(Sorry, mistyped :) 1 + 0 < 1 returns false, of course, causing the check <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75> to "!initialWriter.isCompressionSatisfying(rawDataByteSize, bytes.size())" to return true and fallback to occur.)
- Claire On Wed, Sep 20, 2023 at 8:28 AM Claire McGinty <claire.d.mcgi...@gmail.com> wrote: > Hey Gang, > > Thanks for the followup! I see what you're saying where it's sometimes > just bad luck with what ends up in the first page. The intuition seems like > a larger page size should produce a better encoding in this case... I > updated my branch > <https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1> > to > add a test with a page size/dict page size of 10MB and am seeing the same > failure, though. > > Something seems kind of odd actually -- when I stepped through the test I > added w/ debugger, it falls back after invoking isCompressionSatisfying > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0 < 1 > returns true. (You can also see this in the System.out logs I added, in the > branch's GHA run logs). This doesn't seem right to me -- does > isCompressionSatsifying need an extra check to make sure the > dictionary isn't empty? > > Also, thanks, Aaron! I got into this while running some micro-benchmarks > on Parquet reads when various dictionary/bloom filter/encoding options are > configured. Happy to share out when I'm done. > > Best, > Claire > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <ust...@gmail.com> wrote: > >> Thanks for the investigation! >> >> I think the check below makes sense for a single page: >> @Override >> public boolean isCompressionSatisfying(long rawSize, long encodedSize) { >> return (encodedSize + dictionaryByteSize) < rawSize; >> } >> >> The problem is that the fallback check is only performed on the first >> page. >> In the first page, all values in that page may be distinct, so it will >> unlikely >> pass the isCompressionSatisfying check. >> >> Best, >> Gang >> >> >> On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett >> <aniskodedoss...@etsy.com.invalid> wrote: >> >> > Claire, thank you for your research and examples on this topic, I've >> > learned a lot. My hunch is that your change would be a good one, but >> I'm >> > not an expert (and more to the point, not a committer). I'm looking >> > forward to learning more as this discussion continues. >> > >> > Thank you again, Aaron >> > >> > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty < >> claire.d.mcgi...@gmail.com >> > > >> > wrote: >> > >> > > I created a quick branch >> > > < >> > > >> > >> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 >> > > > >> > > to reproduce what I'm seeing -- the test shows that an Int column with >> > > cardinality 100 successfully results in a dict encoding, but an int >> > column >> > > with cardinality 10,000 falls back and doesn't create a dict encoding. >> > This >> > > seems like a low threshold given the 1MB dictionary page size, so I >> just >> > > wanted to check whether this is expected or not :) >> > > >> > > Best, >> > > Claire >> > > >> > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty < >> > claire.d.mcgi...@gmail.com >> > > > >> > > wrote: >> > > >> > > > Hi, just wanted to follow up on this! >> > > > >> > > > I ran a debugger to find out why my column wasn't ending up with a >> > > > dictionary encoding and it turns out that even though >> > > > DictionaryValuesWriter#shouldFallback() >> > > > < >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 >> > > > >> > > > always returned false (dictionaryByteSize was always less than my >> > > > configured page size), >> DictionaryValuesWriter#isCompressionSatisfying >> > > > < >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125 >> > > >> > > was >> > > > what was causing Parquet to switch >> > > > < >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75 >> > > > >> > > > back to the fallback, non-dict writer. >> > > > >> > > > From what I can tell, this check compares the total byte size of >> > > > *every* element with the byte size of each *distinct* element as a >> kind >> > > of >> > > > proxy for encoding efficiency.... however, it seems strange that >> this >> > > check >> > > > can cause the writer to fall back even if the total encoded dict >> size >> > is >> > > > far below the configured dictionary page size. Out of curiosity, I >> > > modified >> > > > DictionaryValuesWriter#isCompressionSatisfying >> > > > < >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125 >> > > >> > > to >> > > > also check whether total byte size was less than dictionary max size >> > and >> > > > re-ran my Parquet write with a local snapshot, and my file size >> dropped >> > > 50%. >> > > > >> > > > Best, >> > > > Claire >> > > > >> > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty < >> > > claire.d.mcgi...@gmail.com> >> > > > wrote: >> > > > >> > > >> Oh, interesting! I'm setting it via the >> > > >> ParquetWriter#withDictionaryPageSize method, and I do see the >> overall >> > > file >> > > >> size increasing when I bump the value. I'll look into it a bit >> more -- >> > > it >> > > >> would be helpful for some cases where the # unique values in a >> column >> > is >> > > >> just over the size limit. >> > > >> >> > > >> - Claire >> > > >> >> > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield < >> > emkornfi...@gmail.com> >> > > >> wrote: >> > > >> >> > > >>> I'll note there is also a check for encoding effectiveness [1] >> that >> > > could >> > > >>> come into play but I'd guess that isn't the case here. >> > > >>> >> > > >>> [1] >> > > >>> >> > > >>> >> > > >> > >> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124 >> > > >>> >> > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield < >> > emkornfi...@gmail.com >> > > > >> > > >>> wrote: >> > > >>> >> > > >>> > I'm glad I was looking at the right setting for dictionary >> size. I >> > > just >> > > >>> >> tried it out with 10x, 50x, and even total file size, though, >> and >> > > >>> still am >> > > >>> >> not seeing a dictionary get created. Is it possible it's >> bounded >> > by >> > > >>> file >> > > >>> >> page size or some other layout option that I need to bump as >> well? >> > > >>> > >> > > >>> > >> > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else >> to >> > > >>> chime >> > > >>> > in. If I had to guess, maybe somehow the config value isn't >> making >> > > it >> > > >>> to >> > > >>> > the writer (but there could also be something else at play). >> > > >>> > >> > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty < >> > > >>> claire.d.mcgi...@gmail.com> >> > > >>> > wrote: >> > > >>> > >> > > >>> >> Thanks so much, Micah! >> > > >>> >> >> > > >>> >> I think you are using the right setting, but maybe it is >> possible >> > > the >> > > >>> >> > strings are still exceeding the threshold (perhaps >> increasing it >> > > by >> > > >>> 50x >> > > >>> >> or >> > > >>> >> > more to verify) >> > > >>> >> >> > > >>> >> >> > > >>> >> I'm glad I was looking at the right setting for dictionary >> size. I >> > > >>> just >> > > >>> >> tried it out with 10x, 50x, and even total file size, though, >> and >> > > >>> still am >> > > >>> >> not seeing a dictionary get created. Is it possible it's >> bounded >> > by >> > > >>> file >> > > >>> >> page size or some other layout option that I need to bump as >> well? >> > > >>> >> >> > > >>> >> I haven't seen my discussion during my time in the community >> but >> > > >>> maybe it >> > > >>> >> > was discussed in the past. I think the main challenge here >> is >> > > that >> > > >>> >> pages >> > > >>> >> > are either dictionary encoded or not. I'd guess to make this >> > > >>> practical >> > > >>> >> > there would need to be a new hybrid page type, which I think >> it >> > > >>> might >> > > >>> >> be an >> > > >>> >> > interesting idea but quite a bit of work. Additionally, one >> > would >> > > >>> >> likely >> > > >>> >> > need heuristics for when to potentially use the new mode >> versus >> > a >> > > >>> >> complete >> > > >>> >> > fallback. >> > > >>> >> > >> > > >>> >> >> > > >>> >> Got it, thanks for the explanation! It does seem like a huge >> > amount >> > > of >> > > >>> >> work >> > > >>> >> >> > > >>> >> >> > > >>> >> Best, >> > > >>> >> Claire >> > > >>> >> >> > > >>> >> >> > > >>> >> >> > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield < >> > > >>> emkornfi...@gmail.com> >> > > >>> >> wrote: >> > > >>> >> >> > > >>> >> > > >> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to >> > succeed >> > > >>> for a >> > > >>> >> > > given column? >> > > >>> >> > >> > > >>> >> > >> > > >>> >> > >> > > >>> >> > >> > > >>> >> >> > > >>> >> > > >> > >> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 >> > > >>> >> > >> > > >>> >> > >> > > >>> >> > > - Is that heuristic configurable at all? >> > > >>> >> > >> > > >>> >> > >> > > >>> >> > I think you are using the right setting, but maybe it is >> > possible >> > > >>> the >> > > >>> >> > strings are still exceeding the threshold (perhaps >> increasing it >> > > by >> > > >>> 50x >> > > >>> >> or >> > > >>> >> > more to verify) >> > > >>> >> > >> > > >>> >> > >> > > >>> >> > > - For high-cardinality datasets, has the idea of a >> > > frequency-based >> > > >>> >> > > dictionary encoding been explored? Say, if the data >> follows a >> > > >>> certain >> > > >>> >> > > statistical distribution, we can create a dictionary of the >> > most >> > > >>> >> frequent >> > > >>> >> > > values only? >> > > >>> >> > >> > > >>> >> > I haven't seen my discussion during my time in the community >> but >> > > >>> maybe >> > > >>> >> it >> > > >>> >> > was discussed in the past. I think the main challenge here >> is >> > > that >> > > >>> >> pages >> > > >>> >> > are either dictionary encoded or not. I'd guess to make this >> > > >>> practical >> > > >>> >> > there would need to be a new hybrid page type, which I think >> it >> > > >>> might >> > > >>> >> be an >> > > >>> >> > interesting idea but quite a bit of work. Additionally, one >> > would >> > > >>> >> likely >> > > >>> >> > need heuristics for when to potentially use the new mode >> versus >> > a >> > > >>> >> complete >> > > >>> >> > fallback. >> > > >>> >> > >> > > >>> >> > Cheers, >> > > >>> >> > Micah >> > > >>> >> > >> > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty < >> > > >>> >> > claire.d.mcgi...@gmail.com> >> > > >>> >> > wrote: >> > > >>> >> > >> > > >>> >> > > Hi dev@, >> > > >>> >> > > >> > > >>> >> > > I'm running some benchmarking on Parquet read/write >> > performance >> > > >>> and >> > > >>> >> have >> > > >>> >> > a >> > > >>> >> > > few questions about how dictionary encoding works under the >> > > hood. >> > > >>> Let >> > > >>> >> me >> > > >>> >> > > know if there's a better channel for this :) >> > > >>> >> > > >> > > >>> >> > > My test case uses parquet-avro, where I'm writing a single >> > file >> > > >>> >> > containing >> > > >>> >> > > 5 million records. Each record has a single column, an Avro >> > > String >> > > >>> >> field >> > > >>> >> > > (Parquet binary field). I ran two configurations of base >> > setup: >> > > >>> in the >> > > >>> >> > > first case, the string field has 5,000 possible unique >> values. >> > > In >> > > >>> the >> > > >>> >> > > second case, it has 50,000 unique values. >> > > >>> >> > > >> > > >>> >> > > In the first case (5k unique values), I used parquet-tools >> to >> > > >>> inspect >> > > >>> >> the >> > > >>> >> > > file metadata and found that a dictionary had been written: >> > > >>> >> > > >> > > >>> >> > > % parquet-tools meta testdata-case1.parquet >> > > >>> >> > > > file schema: testdata.TestRecord >> > > >>> >> > > > >> > > >>> >> > > > >> > > >>> >> > > >> > > >>> >> > >> > > >>> >> >> > > >>> >> > > >> > >> -------------------------------------------------------------------------------- >> > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 >> > > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 >> > > >>> >> > > > >> > > >>> >> > > > >> > > >>> >> > > >> > > >>> >> > >> > > >>> >> >> > > >>> >> > > >> > >> -------------------------------------------------------------------------------- >> > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 >> > > >>> >> > SZ:8181452/8181452/1.00 >> > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, >> max: >> > > 999, >> > > >>> >> > > num_nulls: >> > > >>> >> > > > 0] >> > > >>> >> > > >> > > >>> >> > > >> > > >>> >> > > But in the second case (50k unique values), parquet-tools >> > shows >> > > >>> that >> > > >>> >> no >> > > >>> >> > > dictionary gets created, and the file size is *much* >> bigger: >> > > >>> >> > > >> > > >>> >> > > % parquet-tools meta testdata-case2.parquet >> > > >>> >> > > > file schema: testdata.TestRecord >> > > >>> >> > > > >> > > >>> >> > > > >> > > >>> >> > > >> > > >>> >> > >> > > >>> >> >> > > >>> >> > > >> > >> -------------------------------------------------------------------------------- >> > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 >> > > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 >> > > >>> >> > > > >> > > >>> >> > > > >> > > >>> >> > > >> > > >>> >> > >> > > >>> >> >> > > >>> >> > > >> > >> -------------------------------------------------------------------------------- >> > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 >> > > >>> >> SZ:43896278/43896278/1.00 >> > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, >> > > >>> num_nulls: 0] >> > > >>> >> > > >> > > >>> >> > > >> > > >>> >> > > (I created a gist of my test reproduction here >> > > >>> >> > > < >> > > >>> >> >> > > >>> >> > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 >> > > >>> >> > >.) >> > > >>> >> > > >> > > >>> >> > > Based on this, I'm guessing there's some tip-over point >> after >> > > >>> which >> > > >>> >> > Parquet >> > > >>> >> > > will give up on writing a dictionary for a given column? >> After >> > > >>> reading >> > > >>> >> > > the Configuration >> > > >>> >> > > docs >> > > >>> >> > > < >> > > >>> >> > >> > > >>> >> >> > > >>> >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md >> > > >>> >> > > >, >> > > >>> >> > > I tried increasing the dictionary page size configuration >> 5x, >> > > >>> with the >> > > >>> >> > same >> > > >>> >> > > result (no dictionary created). >> > > >>> >> > > >> > > >>> >> > > So to summarize, my questions are: >> > > >>> >> > > >> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to >> > succeed >> > > >>> for a >> > > >>> >> > > given column? >> > > >>> >> > > - Is that heuristic configurable at all? >> > > >>> >> > > - For high-cardinality datasets, has the idea of a >> > > frequency-based >> > > >>> >> > > dictionary encoding been explored? Say, if the data >> follows a >> > > >>> certain >> > > >>> >> > > statistical distribution, we can create a dictionary of the >> > most >> > > >>> >> frequent >> > > >>> >> > > values only? >> > > >>> >> > > >> > > >>> >> > > Thanks for your time! >> > > >>> >> > > - Claire >> > > >>> >> > > >> > > >>> >> > >> > > >>> >> >> > > >>> > >> > > >>> >> > > >> >> > > >> > >> > >> > -- >> > Aaron Niskode-Dossett, Data Engineering -- Etsy >> > >> >