I agree thanks for looking into this. Overall, I think the equalizing behavior of file-level compression (ZSTD) > makes it not worth it to add a configuration option for dictionary > compression :)
One reason to potentially still move forward with configuration here is that general purpose compression can be significantly slower then dictionary compression. If you've already chosen to compress then I agree a new knob might not be worth it. At least some recent research [1] point to this being a bottleneck. [1] https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf On Wed, Sep 27, 2023 at 9:58 PM Gang Wu <ust...@gmail.com> wrote: > Thanks for the thorough benchmark. These findings are pretty interesting! > > On Thu, Sep 28, 2023 at 5:32 AM Claire McGinty <claire.d.mcgi...@gmail.com > > > wrote: > > > Hi all, > > > > Just to follow up, I ran some benchmarks with an added Configuration > option > > to set a desired "compression ratio" param, which you can see here > > < > > > https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#max-dictionary-compression-ratio-option > > >, > > on a variety of data layouts (distribution, sorting, cardinality etc). I > > also have a table of comparisons > > < > > > https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#overall-comparison > > > > > using > > the latest 0.14.0-SNAPSHOT as a baseline. These are my takeaways: > > > > - The compression ratio param doesn't benefit data that's in sorted > > order (IMO, because even without the addition of this param, sorted > > columns > > are more likely to produce efficient dict encodings). > > - On shuffled (non-sorted) data, setting the compression ratio param > > produces a much better *uncompressed* file result (in one case, 39MB > to > > the baseline 102MB). However, after applying a file-level compression > > algorithm such as ZSTD, the baseline and ratio-param results turn out > > about > > pretty much equal (within 5% margin). I think this makes sense, since > in > > Parquet 1.0 dictionaries are not encoded, so ZSTD must be better at > > compressing many repeated column values across a file than it is at > > compressing dictionaries across all pages. > > - The compression ratio param works best with larger page sizes (10 mb > > or 50mb) with large-ish dictionary page sizes (10mb). > > > > Overall, I think the equalizing behavior of file-level compression (ZSTD) > > makes it not worth it to add a configuration option for dictionary > > compression :) Thanks for all of your input on this -- if nothing else, > the > > benchmarks are a really interesting look at how important data layout is > to > > overall file size! > > > > Best, > > Claire > > > > On Thu, Sep 21, 2023 at 2:12 PM Claire McGinty < > claire.d.mcgi...@gmail.com > > > > > wrote: > > > > > Hmm, like a flag to basically turn off the isCompressionSatisfying > check > > > per-column? That might be simplest! > > > > > > So to summarize, a column will not write a dictionary encoding when > > either: > > > > > > (1) `parquet.enable.dictionary` is set to False > > > (2) # of distinct values in a column chunk exceeds > > > `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to > > > Integer.MAX_VALUE) > > > (3) Total encoded bytes in a dictionary exceed the value of > > > `parquet.dictionary.page.size` > > > (4) Desired compression ratio (as a measure of # distinct values : > total > > # > > > values) is not achieved > > > > > > I might try out various options for making (4) configurable, starting > > with > > > your suggestion, and testing them out on more realistic data > > distributions. > > > Will try to return to this thread with my results in a few days :) > > > > > > Best, > > > Claire > > > > > > > > > On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <ust...@gmail.com> wrote: > > > > > >> The current implementation only checks the first page, which is > > >> vulnerable in many cases. I think your suggestion makes sense. > > >> However, there is no one-fit-for-all solution. How about simply > > >> adding a flag to enforce dictionary encoding to a specific column? > > >> > > >> > > >> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty < > > >> claire.d.mcgi...@gmail.com> > > >> wrote: > > >> > > >> > I think I figured it out! The dictionaryByteSize == 0 was a red > > >> herring; I > > >> > was looking at an IntegerDictionaryValuesWriter for an empty column > > >> rather > > >> > than my high-cardinality column. Your analysis of the situation was > > >> > right--it was just that in the first page, there weren't enough > > distinct > > >> > values to pass the check. > > >> > > > >> > I wonder if we could maybe make this value configurable per-column? > > >> Either: > > >> > > > >> > - A desired ratio of distinct values / total values, on a scale of > > 0-1.0 > > >> > - Number of pages to check for compression before falling back > > >> > > > >> > Let me know what you think! > > >> > > > >> > Best, > > >> > Claire > > >> > > > >> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <ust...@gmail.com> wrote: > > >> > > > >> > > I don't understand why you get encodedSize == 1, > dictionaryByteSize > > >> == 0 > > >> > > and rawSize == 0 in the first page check. It seems that the page > > does > > >> not > > >> > > have any meaning values. Could you please check how many values > are > > >> > > written before the page check? > > >> > > > > >> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty < > > >> > > claire.d.mcgi...@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > Hey Gang, > > >> > > > > > >> > > > Thanks for the followup! I see what you're saying where it's > > >> sometimes > > >> > > just > > >> > > > bad luck with what ends up in the first page. The intuition > seems > > >> like > > >> > a > > >> > > > larger page size should produce a better encoding in this > case... > > I > > >> > > updated > > >> > > > my branch > > >> > > > < > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 > > >> > > > > > > >> > > > to > > >> > > > add a test with a page size/dict page size of 10MB and am seeing > > the > > >> > same > > >> > > > failure, though. > > >> > > > > > >> > > > Something seems kind of odd actually -- when I stepped through > the > > >> > test I > > >> > > > added w/ debugger, it falls back after invoking > > >> isCompressionSatisfying > > >> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == > 0; 1 > > >> + 0 > > >> > < > > >> > > 1 > > >> > > > returns true. (You can also see this in the System.out logs I > > >> added, in > > >> > > the > > >> > > > branch's GHA run logs). This doesn't seem right to me -- does > > >> > > > isCompressionSatsifying need an extra check to make sure the > > >> > > > dictionary isn't empty? > > >> > > > > > >> > > > Also, thanks, Aaron! I got into this while running some > > >> > micro-benchmarks > > >> > > on > > >> > > > Parquet reads when various dictionary/bloom filter/encoding > > options > > >> are > > >> > > > configured. Happy to share out when I'm done. > > >> > > > > > >> > > > Best, > > >> > > > Claire > > >> > > > > > >> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <ust...@gmail.com> > wrote: > > >> > > > > > >> > > > > Thanks for the investigation! > > >> > > > > > > >> > > > > I think the check below makes sense for a single page: > > >> > > > > @Override > > >> > > > > public boolean isCompressionSatisfying(long rawSize, long > > >> > > encodedSize) > > >> > > > { > > >> > > > > return (encodedSize + dictionaryByteSize) < rawSize; > > >> > > > > } > > >> > > > > > > >> > > > > The problem is that the fallback check is only performed on > the > > >> first > > >> > > > page. > > >> > > > > In the first page, all values in that page may be distinct, so > > it > > >> > will > > >> > > > > unlikely > > >> > > > > pass the isCompressionSatisfying check. > > >> > > > > > > >> > > > > Best, > > >> > > > > Gang > > >> > > > > > > >> > > > > > > >> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett > > >> > > > > <aniskodedoss...@etsy.com.invalid> wrote: > > >> > > > > > > >> > > > > > Claire, thank you for your research and examples on this > > topic, > > >> > I've > > >> > > > > > learned a lot. My hunch is that your change would be a good > > >> one, > > >> > but > > >> > > > I'm > > >> > > > > > not an expert (and more to the point, not a committer). I'm > > >> > looking > > >> > > > > > forward to learning more as this discussion continues. > > >> > > > > > > > >> > > > > > Thank you again, Aaron > > >> > > > > > > > >> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty < > > >> > > > > claire.d.mcgi...@gmail.com > > >> > > > > > > > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > I created a quick branch > > >> > > > > > > < > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 > > >> > > > > > > > > > >> > > > > > > to reproduce what I'm seeing -- the test shows that an Int > > >> column > > >> > > > with > > >> > > > > > > cardinality 100 successfully results in a dict encoding, > but > > >> an > > >> > int > > >> > > > > > column > > >> > > > > > > with cardinality 10,000 falls back and doesn't create a > dict > > >> > > > encoding. > > >> > > > > > This > > >> > > > > > > seems like a low threshold given the 1MB dictionary page > > size, > > >> > so I > > >> > > > > just > > >> > > > > > > wanted to check whether this is expected or not :) > > >> > > > > > > > > >> > > > > > > Best, > > >> > > > > > > Claire > > >> > > > > > > > > >> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty < > > >> > > > > > claire.d.mcgi...@gmail.com > > >> > > > > > > > > > >> > > > > > > wrote: > > >> > > > > > > > > >> > > > > > > > Hi, just wanted to follow up on this! > > >> > > > > > > > > > >> > > > > > > > I ran a debugger to find out why my column wasn't ending > > up > > >> > with > > >> > > a > > >> > > > > > > > dictionary encoding and it turns out that even though > > >> > > > > > > > DictionaryValuesWriter#shouldFallback() > > >> > > > > > > > < > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > > >> > > > > > > > > > >> > > > > > > > always returned false (dictionaryByteSize was always > less > > >> than > > >> > my > > >> > > > > > > > configured page size), > > >> > > > DictionaryValuesWriter#isCompressionSatisfying > > >> > > > > > > > < > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125 > > >> > > > > > > > > >> > > > > > > was > > >> > > > > > > > what was causing Parquet to switch > > >> > > > > > > > < > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75 > > >> > > > > > > > > > >> > > > > > > > back to the fallback, non-dict writer. > > >> > > > > > > > > > >> > > > > > > > From what I can tell, this check compares the total byte > > >> size > > >> > of > > >> > > > > > > > *every* element with the byte size of each *distinct* > > >> element > > >> > as > > >> > > a > > >> > > > > kind > > >> > > > > > > of > > >> > > > > > > > proxy for encoding efficiency.... however, it seems > > strange > > >> > that > > >> > > > this > > >> > > > > > > check > > >> > > > > > > > can cause the writer to fall back even if the total > > encoded > > >> > dict > > >> > > > size > > >> > > > > > is > > >> > > > > > > > far below the configured dictionary page size. Out of > > >> > curiosity, > > >> > > I > > >> > > > > > > modified > > >> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying > > >> > > > > > > > < > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125 > > >> > > > > > > > > >> > > > > > > to > > >> > > > > > > > also check whether total byte size was less than > > dictionary > > >> max > > >> > > > size > > >> > > > > > and > > >> > > > > > > > re-ran my Parquet write with a local snapshot, and my > file > > >> size > > >> > > > > dropped > > >> > > > > > > 50%. > > >> > > > > > > > > > >> > > > > > > > Best, > > >> > > > > > > > Claire > > >> > > > > > > > > > >> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty < > > >> > > > > > > claire.d.mcgi...@gmail.com> > > >> > > > > > > > wrote: > > >> > > > > > > > > > >> > > > > > > >> Oh, interesting! I'm setting it via the > > >> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do > see > > >> the > > >> > > > > overall > > >> > > > > > > file > > >> > > > > > > >> size increasing when I bump the value. I'll look into > it > > a > > >> bit > > >> > > > more > > >> > > > > -- > > >> > > > > > > it > > >> > > > > > > >> would be helpful for some cases where the # unique > values > > >> in a > > >> > > > > column > > >> > > > > > is > > >> > > > > > > >> just over the size limit. > > >> > > > > > > >> > > >> > > > > > > >> - Claire > > >> > > > > > > >> > > >> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield < > > >> > > > > > emkornfi...@gmail.com> > > >> > > > > > > >> wrote: > > >> > > > > > > >> > > >> > > > > > > >>> I'll note there is also a check for encoding > > effectiveness > > >> > [1] > > >> > > > that > > >> > > > > > > could > > >> > > > > > > >>> come into play but I'd guess that isn't the case here. > > >> > > > > > > >>> > > >> > > > > > > >>> [1] > > >> > > > > > > >>> > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124 > > >> > > > > > > >>> > > >> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield < > > >> > > > > > emkornfi...@gmail.com > > >> > > > > > > > > > >> > > > > > > >>> wrote: > > >> > > > > > > >>> > > >> > > > > > > >>> > I'm glad I was looking at the right setting for > > >> dictionary > > >> > > > size. > > >> > > > > I > > >> > > > > > > just > > >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file > size, > > >> > > though, > > >> > > > > and > > >> > > > > > > >>> still am > > >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible > > >> it's > > >> > > > bounded > > >> > > > > > by > > >> > > > > > > >>> file > > >> > > > > > > >>> >> page size or some other layout option that I need > to > > >> bump > > >> > as > > >> > > > > well? > > >> > > > > > > >>> > > > >> > > > > > > >>> > > > >> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully > > >> someone > > >> > > else > > >> > > > > to > > >> > > > > > > >>> chime > > >> > > > > > > >>> > in. If I had to guess, maybe somehow the config > value > > >> > isn't > > >> > > > > making > > >> > > > > > > it > > >> > > > > > > >>> to > > >> > > > > > > >>> > the writer (but there could also be something else > at > > >> > play). > > >> > > > > > > >>> > > > >> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty < > > >> > > > > > > >>> claire.d.mcgi...@gmail.com> > > >> > > > > > > >>> > wrote: > > >> > > > > > > >>> > > > >> > > > > > > >>> >> Thanks so much, Micah! > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> I think you are using the right setting, but maybe > it > > >> is > > >> > > > > possible > > >> > > > > > > the > > >> > > > > > > >>> >> > strings are still exceeding the threshold > (perhaps > > >> > > > increasing > > >> > > > > it > > >> > > > > > > by > > >> > > > > > > >>> 50x > > >> > > > > > > >>> >> or > > >> > > > > > > >>> >> > more to verify) > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> I'm glad I was looking at the right setting for > > >> dictionary > > >> > > > > size. I > > >> > > > > > > >>> just > > >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file > size, > > >> > > though, > > >> > > > > and > > >> > > > > > > >>> still am > > >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible > > >> it's > > >> > > > bounded > > >> > > > > > by > > >> > > > > > > >>> file > > >> > > > > > > >>> >> page size or some other layout option that I need > to > > >> bump > > >> > as > > >> > > > > well? > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> I haven't seen my discussion during my time in the > > >> > community > > >> > > > but > > >> > > > > > > >>> maybe it > > >> > > > > > > >>> >> > was discussed in the past. I think the main > > >> challenge > > >> > > here > > >> > > > is > > >> > > > > > > that > > >> > > > > > > >>> >> pages > > >> > > > > > > >>> >> > are either dictionary encoded or not. I'd guess > to > > >> make > > >> > > > this > > >> > > > > > > >>> practical > > >> > > > > > > >>> >> > there would need to be a new hybrid page type, > > which > > >> I > > >> > > think > > >> > > > > it > > >> > > > > > > >>> might > > >> > > > > > > >>> >> be an > > >> > > > > > > >>> >> > interesting idea but quite a bit of work. > > >> Additionally, > > >> > > one > > >> > > > > > would > > >> > > > > > > >>> >> likely > > >> > > > > > > >>> >> > need heuristics for when to potentially use the > new > > >> mode > > >> > > > > versus > > >> > > > > > a > > >> > > > > > > >>> >> complete > > >> > > > > > > >>> >> > fallback. > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> Got it, thanks for the explanation! It does seem > > like a > > >> > huge > > >> > > > > > amount > > >> > > > > > > of > > >> > > > > > > >>> >> work > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> Best, > > >> > > > > > > >>> >> Claire > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield < > > >> > > > > > > >>> emkornfi...@gmail.com> > > >> > > > > > > >>> >> wrote: > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary > > >> writing > > >> > to > > >> > > > > > succeed > > >> > > > > > > >>> for a > > >> > > > > > > >>> >> > > given column? > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > - Is that heuristic configurable at all? > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > I think you are using the right setting, but > maybe > > >> it is > > >> > > > > > possible > > >> > > > > > > >>> the > > >> > > > > > > >>> >> > strings are still exceeding the threshold > (perhaps > > >> > > > increasing > > >> > > > > it > > >> > > > > > > by > > >> > > > > > > >>> 50x > > >> > > > > > > >>> >> or > > >> > > > > > > >>> >> > more to verify) > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea > of > > a > > >> > > > > > > frequency-based > > >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the > > data > > >> > > > follows > > >> > > > > a > > >> > > > > > > >>> certain > > >> > > > > > > >>> >> > > statistical distribution, we can create a > > >> dictionary > > >> > of > > >> > > > the > > >> > > > > > most > > >> > > > > > > >>> >> frequent > > >> > > > > > > >>> >> > > values only? > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > I haven't seen my discussion during my time in > the > > >> > > community > > >> > > > > but > > >> > > > > > > >>> maybe > > >> > > > > > > >>> >> it > > >> > > > > > > >>> >> > was discussed in the past. I think the main > > >> challenge > > >> > > here > > >> > > > is > > >> > > > > > > that > > >> > > > > > > >>> >> pages > > >> > > > > > > >>> >> > are either dictionary encoded or not. I'd guess > to > > >> make > > >> > > > this > > >> > > > > > > >>> practical > > >> > > > > > > >>> >> > there would need to be a new hybrid page type, > > which > > >> I > > >> > > think > > >> > > > > it > > >> > > > > > > >>> might > > >> > > > > > > >>> >> be an > > >> > > > > > > >>> >> > interesting idea but quite a bit of work. > > >> Additionally, > > >> > > one > > >> > > > > > would > > >> > > > > > > >>> >> likely > > >> > > > > > > >>> >> > need heuristics for when to potentially use the > new > > >> mode > > >> > > > > versus > > >> > > > > > a > > >> > > > > > > >>> >> complete > > >> > > > > > > >>> >> > fallback. > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > Cheers, > > >> > > > > > > >>> >> > Micah > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty < > > >> > > > > > > >>> >> > claire.d.mcgi...@gmail.com> > > >> > > > > > > >>> >> > wrote: > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > Hi dev@, > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > I'm running some benchmarking on Parquet > > read/write > > >> > > > > > performance > > >> > > > > > > >>> and > > >> > > > > > > >>> >> have > > >> > > > > > > >>> >> > a > > >> > > > > > > >>> >> > > few questions about how dictionary encoding > works > > >> > under > > >> > > > the > > >> > > > > > > hood. > > >> > > > > > > >>> Let > > >> > > > > > > >>> >> me > > >> > > > > > > >>> >> > > know if there's a better channel for this :) > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm > > writing a > > >> > > single > > >> > > > > > file > > >> > > > > > > >>> >> > containing > > >> > > > > > > >>> >> > > 5 million records. Each record has a single > > >> column, an > > >> > > > Avro > > >> > > > > > > String > > >> > > > > > > >>> >> field > > >> > > > > > > >>> >> > > (Parquet binary field). I ran two > configurations > > of > > >> > base > > >> > > > > > setup: > > >> > > > > > > >>> in the > > >> > > > > > > >>> >> > > first case, the string field has 5,000 possible > > >> unique > > >> > > > > values. > > >> > > > > > > In > > >> > > > > > > >>> the > > >> > > > > > > >>> >> > > second case, it has 50,000 unique values. > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > In the first case (5k unique values), I used > > >> > > parquet-tools > > >> > > > > to > > >> > > > > > > >>> inspect > > >> > > > > > > >>> >> the > > >> > > > > > > >>> >> > > file metadata and found that a dictionary had > > been > > >> > > > written: > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet > > >> > > > > > > >>> >> > > > file schema: testdata.TestRecord > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > -------------------------------------------------------------------------------- > > >> > > > > > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 > D:0 > > >> > > > > > > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > -------------------------------------------------------------------------------- > > >> > > > > > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:4 > > FPO:38918 > > >> > > > > > > >>> >> > SZ:8181452/8181452/1.00 > > >> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY > > >> ST:[min: > > >> > 0, > > >> > > > > max: > > >> > > > > > > 999, > > >> > > > > > > >>> >> > > num_nulls: > > >> > > > > > > >>> >> > > > 0] > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > But in the second case (50k unique values), > > >> > > parquet-tools > > >> > > > > > shows > > >> > > > > > > >>> that > > >> > > > > > > >>> >> no > > >> > > > > > > >>> >> > > dictionary gets created, and the file size is > > >> *much* > > >> > > > bigger: > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet > > >> > > > > > > >>> >> > > > file schema: testdata.TestRecord > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > -------------------------------------------------------------------------------- > > >> > > > > > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 > D:0 > > >> > > > > > > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > -------------------------------------------------------------------------------- > > >> > > > > > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 > > >> > > > > > > >>> >> SZ:43896278/43896278/1.00 > > >> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, > > max: > > >> > 9999, > > >> > > > > > > >>> num_nulls: 0] > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > (I created a gist of my test reproduction here > > >> > > > > > > >>> >> > > < > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > >> > > > > > > > >> > > > > >> > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 > > >> > > > > > > >>> >> > >.) > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > Based on this, I'm guessing there's some > tip-over > > >> > point > > >> > > > > after > > >> > > > > > > >>> which > > >> > > > > > > >>> >> > Parquet > > >> > > > > > > >>> >> > > will give up on writing a dictionary for a > given > > >> > column? > > >> > > > > After > > >> > > > > > > >>> reading > > >> > > > > > > >>> >> > > the Configuration > > >> > > > > > > >>> >> > > docs > > >> > > > > > > >>> >> > > < > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > >> > > > > > > >>> >> > > >, > > >> > > > > > > >>> >> > > I tried increasing the dictionary page size > > >> > > configuration > > >> > > > > 5x, > > >> > > > > > > >>> with the > > >> > > > > > > >>> >> > same > > >> > > > > > > >>> >> > > result (no dictionary created). > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > So to summarize, my questions are: > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary > > >> writing > > >> > to > > >> > > > > > succeed > > >> > > > > > > >>> for a > > >> > > > > > > >>> >> > > given column? > > >> > > > > > > >>> >> > > - Is that heuristic configurable at all? > > >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea > of > > a > > >> > > > > > > frequency-based > > >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the > > data > > >> > > > follows > > >> > > > > a > > >> > > > > > > >>> certain > > >> > > > > > > >>> >> > > statistical distribution, we can create a > > >> dictionary > > >> > of > > >> > > > the > > >> > > > > > most > > >> > > > > > > >>> >> frequent > > >> > > > > > > >>> >> > > values only? > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > Thanks for your time! > > >> > > > > > > >>> >> > > - Claire > > >> > > > > > > >>> >> > > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> > > > >> > > > > > > >>> > > >> > > > > > > >> > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > -- > > >> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > >