Thanks both for keeping us posted!
Seems like we are going to need another Storage-api release to fix the
dependency issue that was discovered recerntly.
@Kyle Regarding the HT dictionary regression -- seems to be quite critical.
Can you please point to the actual failing test? I could only find build
failures in the logs..
Regarding the JMH benchmarks, I am a happy user myself --
*ORCWriterBenchMark* was created for this very reason.
Maybe it would make sense to extend that for nested schemas as described
above?

Cheers,
Panagiotis


On Mon, Jul 26, 2021 at 9:21 AM Kyle Bendickson
<[email protected]> wrote:

> Thanks for the summary of QA for 1.7.0 Dongjoon =)
>
> Wanted to follow up on a few things I’ve come across testing several
> different versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case
> others have insight into the issues I’ve faced:
>
> 1) I wanted to mention that the NPE in the Iceberg tests for
> 1.7.0-SNAPSHOT did not occur for one of the earlier versions of the
> snapshot 1.7.0-SNAPSHOT.  I am still working to determine which version it
> was working with, but here’s a link <
> https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR where
> I am experiencing the NPE in a Flink ORC read/write test suite when using a
> large mixed type / nested schema if others want to see the failed test
> suites (we use GitHub Actions so it should be familiar).
>
> The NPE occurs when calling `close` and attempting to write the file
> metadata. This occurs when calling
> org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically when
> calling getPos on the private field `rawWriter` of type FSDataOututStream,
> so it’s possible that there is some potential configuration issues when
> setting the filesystem for the writer (e.g. in the Iceberg code). However
> several other tests pass, about 5 in total all of which run this exact same
> test code using different schemas, so I’m inclined to think there’s some
> sort of issue with where we read / write to the output stream that is
> surfacing with a large number of serialized columns or something. I am
> still investigating and will report as soon as I know more but wanted to
> bring it to everyone’s attention =)
>
> 2) For the same test which reads and writes 1000 records to a file using
> Flink (though mostly via Iceberg API) with a large / nested schema, ,we
> have to change the writer’s configuration to use the old default value of
> `orc.dictionary.implementation`, i.e. `rbtree` instead of the new default
> value of `hash`. When using `hash`, there is an OOM coming from the
> relatively recently added `StringHashTableDictionary`. I noticed there is a
> comment about not needing large buckets, as very many collisions would mean
> that the "table is too small or the function isn’t good”. <
> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87>[2]
> This test is specifically for testing large and nested schemas (there’s
> over 20 columns with map and list types). But this data isn’t necessarily
> so far outside of the realm of realworld datasets.
>
> Are there any existing performance tests using the new hash based
> dictionary implementation or any guidance we have for users for configuring
> their files (strips, etc) to make effective use of the new hash
> implementation? I’m wondering if users will have a difficult time
> transitioning due to having “a function that isn’t good or a table that’s
> too small"
>
> [1] Github PR with currently failing Iceberg Flink tests for ORC
> 1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 <
> https://github.com/kbendick/iceberg/pull/42>
> [2] ORC StringHashTableDictionary class, with hash bucket instantiation
> with comments about bucket size w.r.t collisions:
> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
> <
> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
> >
>
> Thanks for summarizing the on going work and please let me know if I can
> be of any help outside of the current investigation into the Iceberg
> issues. I’d love to possibly help set up JMH benchmark testing (which is
> admittedly not perfect), as I know we have had several PRs recently to
> optimize some code paths. Would be great to have a way to measure the
> impact somewhat (while recognizing that JMH isn’t exactly perfect).
>
> Best,
> Kyle Bendickson
>
>
> On 2021/07/20 20:33:14, Dongjoon Hyun <[email protected]> wrote:
> > Thank you all for your participation.>
> >
> > Here is a brief summary of the recent ORC community activity for Apache
> ORC>
> > 1.7.0 QA.>
> >
> > 1. After branching branch-1.7, I shared Apache Iceberg integration test>
> > failure.>
> > 2. Kyle Bendickson reported the same issue. In addition, when he
> reverted>
> > to rbtree, he met NPE in some cases. He is investigating it.>
> > 3. David Mollitor has been working on several nice code refactoring and>
> > perf improvement. Currently, his patches are landed on the `main`
> branch>
> > for 1.8.0, but I'm considering backporting them to branch-1.7 if
> needed.>
> > 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed
> it.>
> > 5. Yukihiro Okada made a patch to fix his `head` command contribution
> and>
> > Panos/William reviewed and merged.>
> >
> > In addition to the above, Apache ORC community is waiting for a new
> hive>
> > storage api release.>
> >
> > Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and Pavan>
> > Lanka, Owen and I voted.>
> > ->
> >
> https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E>
>
> >
> > Apache ORC 1.7.0 is coming! Please test it in your environments and give
> us>
> > your feedback.>
> >
> > Best,>
> > Dongjoon.>
> >
>
> 
>
> Kyle Bendickson
> Software Engineer
> Apple
> ACS Data
> One Apple Park Way,
> Cupertino, CA 95014, USA
> [email protected]
>
> This email and any attachments may be privileged and may contain
> confidential information intended only for the recipient(s) named above.
> Any other distribution, forwarding, copying or disclosure of this message
> is strictly prohibited. If you have received this email in error, please
> notify me immediately by telephone or return email, and delete this message
> from your system.
>
>
>

Reply via email to