Thanks both for keeping us posted! Seems like we are going to need another Storage-api release to fix the dependency issue that was discovered recerntly. @Kyle Regarding the HT dictionary regression -- seems to be quite critical. Can you please point to the actual failing test? I could only find build failures in the logs.. Regarding the JMH benchmarks, I am a happy user myself -- *ORCWriterBenchMark* was created for this very reason. Maybe it would make sense to extend that for nested schemas as described above?
Cheers, Panagiotis On Mon, Jul 26, 2021 at 9:21 AM Kyle Bendickson <[email protected]> wrote: > Thanks for the summary of QA for 1.7.0 Dongjoon =) > > Wanted to follow up on a few things I’ve come across testing several > different versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case > others have insight into the issues I’ve faced: > > 1) I wanted to mention that the NPE in the Iceberg tests for > 1.7.0-SNAPSHOT did not occur for one of the earlier versions of the > snapshot 1.7.0-SNAPSHOT. I am still working to determine which version it > was working with, but here’s a link < > https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR where > I am experiencing the NPE in a Flink ORC read/write test suite when using a > large mixed type / nested schema if others want to see the failed test > suites (we use GitHub Actions so it should be familiar). > > The NPE occurs when calling `close` and attempting to write the file > metadata. This occurs when calling > org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically when > calling getPos on the private field `rawWriter` of type FSDataOututStream, > so it’s possible that there is some potential configuration issues when > setting the filesystem for the writer (e.g. in the Iceberg code). However > several other tests pass, about 5 in total all of which run this exact same > test code using different schemas, so I’m inclined to think there’s some > sort of issue with where we read / write to the output stream that is > surfacing with a large number of serialized columns or something. I am > still investigating and will report as soon as I know more but wanted to > bring it to everyone’s attention =) > > 2) For the same test which reads and writes 1000 records to a file using > Flink (though mostly via Iceberg API) with a large / nested schema, ,we > have to change the writer’s configuration to use the old default value of > `orc.dictionary.implementation`, i.e. `rbtree` instead of the new default > value of `hash`. When using `hash`, there is an OOM coming from the > relatively recently added `StringHashTableDictionary`. I noticed there is a > comment about not needing large buckets, as very many collisions would mean > that the "table is too small or the function isn’t good”. < > https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87>[2] > This test is specifically for testing large and nested schemas (there’s > over 20 columns with map and list types). But this data isn’t necessarily > so far outside of the realm of realworld datasets. > > Are there any existing performance tests using the new hash based > dictionary implementation or any guidance we have for users for configuring > their files (strips, etc) to make effective use of the new hash > implementation? I’m wondering if users will have a difficult time > transitioning due to having “a function that isn’t good or a table that’s > too small" > > [1] Github PR with currently failing Iceberg Flink tests for ORC > 1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 < > https://github.com/kbendick/iceberg/pull/42> > [2] ORC StringHashTableDictionary class, with hash bucket instantiation > with comments about bucket size w.r.t collisions: > https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89 > < > https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89 > > > > Thanks for summarizing the on going work and please let me know if I can > be of any help outside of the current investigation into the Iceberg > issues. I’d love to possibly help set up JMH benchmark testing (which is > admittedly not perfect), as I know we have had several PRs recently to > optimize some code paths. Would be great to have a way to measure the > impact somewhat (while recognizing that JMH isn’t exactly perfect). > > Best, > Kyle Bendickson > > > On 2021/07/20 20:33:14, Dongjoon Hyun <[email protected]> wrote: > > Thank you all for your participation.> > > > > Here is a brief summary of the recent ORC community activity for Apache > ORC> > > 1.7.0 QA.> > > > > 1. After branching branch-1.7, I shared Apache Iceberg integration test> > > failure.> > > 2. Kyle Bendickson reported the same issue. In addition, when he > reverted> > > to rbtree, he met NPE in some cases. He is investigating it.> > > 3. David Mollitor has been working on several nice code refactoring and> > > perf improvement. Currently, his patches are landed on the `main` > branch> > > for 1.8.0, but I'm considering backporting them to branch-1.7 if > needed.> > > 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed > it.> > > 5. Yukihiro Okada made a patch to fix his `head` command contribution > and> > > Panos/William reviewed and merged.> > > > > In addition to the above, Apache ORC community is waiting for a new > hive> > > storage api release.> > > > > Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and Pavan> > > Lanka, Owen and I voted.> > > -> > > > https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E> > > > > > Apache ORC 1.7.0 is coming! Please test it in your environments and give > us> > > your feedback.> > > > > Best,> > > Dongjoon.> > > > > > > Kyle Bendickson > Software Engineer > Apple > ACS Data > One Apple Park Way, > Cupertino, CA 95014, USA > [email protected] > > This email and any attachments may be privileged and may contain > confidential information intended only for the recipient(s) named above. > Any other distribution, forwarding, copying or disclosure of this message > is strictly prohibited. If you have received this email in error, please > notify me immediately by telephone or return email, and delete this message > from your system. > > >
