Hi all, Just to follow up, I ran some benchmarks with an added Configuration option to set a desired "compression ratio" param, which you can see here <https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#max-dictionary-compression-ratio-option>, on a variety of data layouts (distribution, sorting, cardinality etc). I also have a table of comparisons <https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#overall-comparison> using the latest 0.14.0-SNAPSHOT as a baseline. These are my takeaways:
- The compression ratio param doesn't benefit data that's in sorted order (IMO, because even without the addition of this param, sorted columns are more likely to produce efficient dict encodings). - On shuffled (non-sorted) data, setting the compression ratio param produces a much better *uncompressed* file result (in one case, 39MB to the baseline 102MB). However, after applying a file-level compression algorithm such as ZSTD, the baseline and ratio-param results turn out about pretty much equal (within 5% margin). I think this makes sense, since in Parquet 1.0 dictionaries are not encoded, so ZSTD must be better at compressing many repeated column values across a file than it is at compressing dictionaries across all pages. - The compression ratio param works best with larger page sizes (10 mb or 50mb) with large-ish dictionary page sizes (10mb). Overall, I think the equalizing behavior of file-level compression (ZSTD) makes it not worth it to add a configuration option for dictionary compression :) Thanks for all of your input on this -- if nothing else, the benchmarks are a really interesting look at how important data layout is to overall file size! Best, Claire On Thu, Sep 21, 2023 at 2:12 PM Claire McGinty <[email protected]> wrote: > Hmm, like a flag to basically turn off the isCompressionSatisfying check > per-column? That might be simplest! > > So to summarize, a column will not write a dictionary encoding when either: > > (1) `parquet.enable.dictionary` is set to False > (2) # of distinct values in a column chunk exceeds > `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to > Integer.MAX_VALUE) > (3) Total encoded bytes in a dictionary exceed the value of > `parquet.dictionary.page.size` > (4) Desired compression ratio (as a measure of # distinct values : total # > values) is not achieved > > I might try out various options for making (4) configurable, starting with > your suggestion, and testing them out on more realistic data distributions. > Will try to return to this thread with my results in a few days :) > > Best, > Claire > > > On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <[email protected]> wrote: > >> The current implementation only checks the first page, which is >> vulnerable in many cases. I think your suggestion makes sense. >> However, there is no one-fit-for-all solution. How about simply >> adding a flag to enforce dictionary encoding to a specific column? >> >> >> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty < >> [email protected]> >> wrote: >> >> > I think I figured it out! The dictionaryByteSize == 0 was a red >> herring; I >> > was looking at an IntegerDictionaryValuesWriter for an empty column >> rather >> > than my high-cardinality column. Your analysis of the situation was >> > right--it was just that in the first page, there weren't enough distinct >> > values to pass the check. >> > >> > I wonder if we could maybe make this value configurable per-column? >> Either: >> > >> > - A desired ratio of distinct values / total values, on a scale of 0-1.0 >> > - Number of pages to check for compression before falling back >> > >> > Let me know what you think! >> > >> > Best, >> > Claire >> > >> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <[email protected]> wrote: >> > >> > > I don't understand why you get encodedSize == 1, dictionaryByteSize >> == 0 >> > > and rawSize == 0 in the first page check. It seems that the page does >> not >> > > have any meaning values. Could you please check how many values are >> > > written before the page check? >> > > >> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty < >> > > [email protected]> >> > > wrote: >> > > >> > > > Hey Gang, >> > > > >> > > > Thanks for the followup! I see what you're saying where it's >> sometimes >> > > just >> > > > bad luck with what ends up in the first page. The intuition seems >> like >> > a >> > > > larger page size should produce a better encoding in this case... I >> > > updated >> > > > my branch >> > > > < >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 >> > > > > >> > > > to >> > > > add a test with a page size/dict page size of 10MB and am seeing the >> > same >> > > > failure, though. >> > > > >> > > > Something seems kind of odd actually -- when I stepped through the >> > test I >> > > > added w/ debugger, it falls back after invoking >> isCompressionSatisfying >> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 >> + 0 >> > < >> > > 1 >> > > > returns true. (You can also see this in the System.out logs I >> added, in >> > > the >> > > > branch's GHA run logs). This doesn't seem right to me -- does >> > > > isCompressionSatsifying need an extra check to make sure the >> > > > dictionary isn't empty? >> > > > >> > > > Also, thanks, Aaron! I got into this while running some >> > micro-benchmarks >> > > on >> > > > Parquet reads when various dictionary/bloom filter/encoding options >> are >> > > > configured. Happy to share out when I'm done. >> > > > >> > > > Best, >> > > > Claire >> > > > >> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <[email protected]> wrote: >> > > > >> > > > > Thanks for the investigation! >> > > > > >> > > > > I think the check below makes sense for a single page: >> > > > > @Override >> > > > > public boolean isCompressionSatisfying(long rawSize, long >> > > encodedSize) >> > > > { >> > > > > return (encodedSize + dictionaryByteSize) < rawSize; >> > > > > } >> > > > > >> > > > > The problem is that the fallback check is only performed on the >> first >> > > > page. >> > > > > In the first page, all values in that page may be distinct, so it >> > will >> > > > > unlikely >> > > > > pass the isCompressionSatisfying check. >> > > > > >> > > > > Best, >> > > > > Gang >> > > > > >> > > > > >> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett >> > > > > <[email protected]> wrote: >> > > > > >> > > > > > Claire, thank you for your research and examples on this topic, >> > I've >> > > > > > learned a lot. My hunch is that your change would be a good >> one, >> > but >> > > > I'm >> > > > > > not an expert (and more to the point, not a committer). I'm >> > looking >> > > > > > forward to learning more as this discussion continues. >> > > > > > >> > > > > > Thank you again, Aaron >> > > > > > >> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty < >> > > > > [email protected] >> > > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > I created a quick branch >> > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 >> > > > > > > > >> > > > > > > to reproduce what I'm seeing -- the test shows that an Int >> column >> > > > with >> > > > > > > cardinality 100 successfully results in a dict encoding, but >> an >> > int >> > > > > > column >> > > > > > > with cardinality 10,000 falls back and doesn't create a dict >> > > > encoding. >> > > > > > This >> > > > > > > seems like a low threshold given the 1MB dictionary page size, >> > so I >> > > > > just >> > > > > > > wanted to check whether this is expected or not :) >> > > > > > > >> > > > > > > Best, >> > > > > > > Claire >> > > > > > > >> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty < >> > > > > > [email protected] >> > > > > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, just wanted to follow up on this! >> > > > > > > > >> > > > > > > > I ran a debugger to find out why my column wasn't ending up >> > with >> > > a >> > > > > > > > dictionary encoding and it turns out that even though >> > > > > > > > DictionaryValuesWriter#shouldFallback() >> > > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 >> > > > > > > > >> > > > > > > > always returned false (dictionaryByteSize was always less >> than >> > my >> > > > > > > > configured page size), >> > > > DictionaryValuesWriter#isCompressionSatisfying >> > > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125 >> > > > > > > >> > > > > > > was >> > > > > > > > what was causing Parquet to switch >> > > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75 >> > > > > > > > >> > > > > > > > back to the fallback, non-dict writer. >> > > > > > > > >> > > > > > > > From what I can tell, this check compares the total byte >> size >> > of >> > > > > > > > *every* element with the byte size of each *distinct* >> element >> > as >> > > a >> > > > > kind >> > > > > > > of >> > > > > > > > proxy for encoding efficiency.... however, it seems strange >> > that >> > > > this >> > > > > > > check >> > > > > > > > can cause the writer to fall back even if the total encoded >> > dict >> > > > size >> > > > > > is >> > > > > > > > far below the configured dictionary page size. Out of >> > curiosity, >> > > I >> > > > > > > modified >> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying >> > > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125 >> > > > > > > >> > > > > > > to >> > > > > > > > also check whether total byte size was less than dictionary >> max >> > > > size >> > > > > > and >> > > > > > > > re-ran my Parquet write with a local snapshot, and my file >> size >> > > > > dropped >> > > > > > > 50%. >> > > > > > > > >> > > > > > > > Best, >> > > > > > > > Claire >> > > > > > > > >> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty < >> > > > > > > [email protected]> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > >> Oh, interesting! I'm setting it via the >> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see >> the >> > > > > overall >> > > > > > > file >> > > > > > > >> size increasing when I bump the value. I'll look into it a >> bit >> > > > more >> > > > > -- >> > > > > > > it >> > > > > > > >> would be helpful for some cases where the # unique values >> in a >> > > > > column >> > > > > > is >> > > > > > > >> just over the size limit. >> > > > > > > >> >> > > > > > > >> - Claire >> > > > > > > >> >> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield < >> > > > > > [email protected]> >> > > > > > > >> wrote: >> > > > > > > >> >> > > > > > > >>> I'll note there is also a check for encoding effectiveness >> > [1] >> > > > that >> > > > > > > could >> > > > > > > >>> come into play but I'd guess that isn't the case here. >> > > > > > > >>> >> > > > > > > >>> [1] >> > > > > > > >>> >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124 >> > > > > > > >>> >> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield < >> > > > > > [email protected] >> > > > > > > > >> > > > > > > >>> wrote: >> > > > > > > >>> >> > > > > > > >>> > I'm glad I was looking at the right setting for >> dictionary >> > > > size. >> > > > > I >> > > > > > > just >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size, >> > > though, >> > > > > and >> > > > > > > >>> still am >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible >> it's >> > > > bounded >> > > > > > by >> > > > > > > >>> file >> > > > > > > >>> >> page size or some other layout option that I need to >> bump >> > as >> > > > > well? >> > > > > > > >>> > >> > > > > > > >>> > >> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully >> someone >> > > else >> > > > > to >> > > > > > > >>> chime >> > > > > > > >>> > in. If I had to guess, maybe somehow the config value >> > isn't >> > > > > making >> > > > > > > it >> > > > > > > >>> to >> > > > > > > >>> > the writer (but there could also be something else at >> > play). >> > > > > > > >>> > >> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty < >> > > > > > > >>> [email protected]> >> > > > > > > >>> > wrote: >> > > > > > > >>> > >> > > > > > > >>> >> Thanks so much, Micah! >> > > > > > > >>> >> >> > > > > > > >>> >> I think you are using the right setting, but maybe it >> is >> > > > > possible >> > > > > > > the >> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps >> > > > increasing >> > > > > it >> > > > > > > by >> > > > > > > >>> 50x >> > > > > > > >>> >> or >> > > > > > > >>> >> > more to verify) >> > > > > > > >>> >> >> > > > > > > >>> >> >> > > > > > > >>> >> I'm glad I was looking at the right setting for >> dictionary >> > > > > size. I >> > > > > > > >>> just >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size, >> > > though, >> > > > > and >> > > > > > > >>> still am >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible >> it's >> > > > bounded >> > > > > > by >> > > > > > > >>> file >> > > > > > > >>> >> page size or some other layout option that I need to >> bump >> > as >> > > > > well? >> > > > > > > >>> >> >> > > > > > > >>> >> I haven't seen my discussion during my time in the >> > community >> > > > but >> > > > > > > >>> maybe it >> > > > > > > >>> >> > was discussed in the past. I think the main >> challenge >> > > here >> > > > is >> > > > > > > that >> > > > > > > >>> >> pages >> > > > > > > >>> >> > are either dictionary encoded or not. I'd guess to >> make >> > > > this >> > > > > > > >>> practical >> > > > > > > >>> >> > there would need to be a new hybrid page type, which >> I >> > > think >> > > > > it >> > > > > > > >>> might >> > > > > > > >>> >> be an >> > > > > > > >>> >> > interesting idea but quite a bit of work. >> Additionally, >> > > one >> > > > > > would >> > > > > > > >>> >> likely >> > > > > > > >>> >> > need heuristics for when to potentially use the new >> mode >> > > > > versus >> > > > > > a >> > > > > > > >>> >> complete >> > > > > > > >>> >> > fallback. >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> >> Got it, thanks for the explanation! It does seem like a >> > huge >> > > > > > amount >> > > > > > > of >> > > > > > > >>> >> work >> > > > > > > >>> >> >> > > > > > > >>> >> >> > > > > > > >>> >> Best, >> > > > > > > >>> >> Claire >> > > > > > > >>> >> >> > > > > > > >>> >> >> > > > > > > >>> >> >> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield < >> > > > > > > >>> [email protected]> >> > > > > > > >>> >> wrote: >> > > > > > > >>> >> >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary >> writing >> > to >> > > > > > succeed >> > > > > > > >>> for a >> > > > > > > >>> >> > > given column? >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > > - Is that heuristic configurable at all? >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > I think you are using the right setting, but maybe >> it is >> > > > > > possible >> > > > > > > >>> the >> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps >> > > > increasing >> > > > > it >> > > > > > > by >> > > > > > > >>> 50x >> > > > > > > >>> >> or >> > > > > > > >>> >> > more to verify) >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a >> > > > > > > frequency-based >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data >> > > > follows >> > > > > a >> > > > > > > >>> certain >> > > > > > > >>> >> > > statistical distribution, we can create a >> dictionary >> > of >> > > > the >> > > > > > most >> > > > > > > >>> >> frequent >> > > > > > > >>> >> > > values only? >> > > > > > > >>> >> > >> > > > > > > >>> >> > I haven't seen my discussion during my time in the >> > > community >> > > > > but >> > > > > > > >>> maybe >> > > > > > > >>> >> it >> > > > > > > >>> >> > was discussed in the past. I think the main >> challenge >> > > here >> > > > is >> > > > > > > that >> > > > > > > >>> >> pages >> > > > > > > >>> >> > are either dictionary encoded or not. I'd guess to >> make >> > > > this >> > > > > > > >>> practical >> > > > > > > >>> >> > there would need to be a new hybrid page type, which >> I >> > > think >> > > > > it >> > > > > > > >>> might >> > > > > > > >>> >> be an >> > > > > > > >>> >> > interesting idea but quite a bit of work. >> Additionally, >> > > one >> > > > > > would >> > > > > > > >>> >> likely >> > > > > > > >>> >> > need heuristics for when to potentially use the new >> mode >> > > > > versus >> > > > > > a >> > > > > > > >>> >> complete >> > > > > > > >>> >> > fallback. >> > > > > > > >>> >> > >> > > > > > > >>> >> > Cheers, >> > > > > > > >>> >> > Micah >> > > > > > > >>> >> > >> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty < >> > > > > > > >>> >> > [email protected]> >> > > > > > > >>> >> > wrote: >> > > > > > > >>> >> > >> > > > > > > >>> >> > > Hi dev@, >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > I'm running some benchmarking on Parquet read/write >> > > > > > performance >> > > > > > > >>> and >> > > > > > > >>> >> have >> > > > > > > >>> >> > a >> > > > > > > >>> >> > > few questions about how dictionary encoding works >> > under >> > > > the >> > > > > > > hood. >> > > > > > > >>> Let >> > > > > > > >>> >> me >> > > > > > > >>> >> > > know if there's a better channel for this :) >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm writing a >> > > single >> > > > > > file >> > > > > > > >>> >> > containing >> > > > > > > >>> >> > > 5 million records. Each record has a single >> column, an >> > > > Avro >> > > > > > > String >> > > > > > > >>> >> field >> > > > > > > >>> >> > > (Parquet binary field). I ran two configurations of >> > base >> > > > > > setup: >> > > > > > > >>> in the >> > > > > > > >>> >> > > first case, the string field has 5,000 possible >> unique >> > > > > values. >> > > > > > > In >> > > > > > > >>> the >> > > > > > > >>> >> > > second case, it has 50,000 unique values. >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > In the first case (5k unique values), I used >> > > parquet-tools >> > > > > to >> > > > > > > >>> inspect >> > > > > > > >>> >> the >> > > > > > > >>> >> > > file metadata and found that a dictionary had been >> > > > written: >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet >> > > > > > > >>> >> > > > file schema: testdata.TestRecord >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------------- >> > > > > > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 >> > > > > > > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------------- >> > > > > > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 >> > > > > > > >>> >> > SZ:8181452/8181452/1.00 >> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY >> ST:[min: >> > 0, >> > > > > max: >> > > > > > > 999, >> > > > > > > >>> >> > > num_nulls: >> > > > > > > >>> >> > > > 0] >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > But in the second case (50k unique values), >> > > parquet-tools >> > > > > > shows >> > > > > > > >>> that >> > > > > > > >>> >> no >> > > > > > > >>> >> > > dictionary gets created, and the file size is >> *much* >> > > > bigger: >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet >> > > > > > > >>> >> > > > file schema: testdata.TestRecord >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------------- >> > > > > > > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 >> > > > > > > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------------- >> > > > > > > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 >> > > > > > > >>> >> SZ:43896278/43896278/1.00 >> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: >> > 9999, >> > > > > > > >>> num_nulls: 0] >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > (I created a gist of my test reproduction here >> > > > > > > >>> >> > > < >> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > >> > > >> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 >> > > > > > > >>> >> > >.) >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > Based on this, I'm guessing there's some tip-over >> > point >> > > > > after >> > > > > > > >>> which >> > > > > > > >>> >> > Parquet >> > > > > > > >>> >> > > will give up on writing a dictionary for a given >> > column? >> > > > > After >> > > > > > > >>> reading >> > > > > > > >>> >> > > the Configuration >> > > > > > > >>> >> > > docs >> > > > > > > >>> >> > > < >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md >> > > > > > > >>> >> > > >, >> > > > > > > >>> >> > > I tried increasing the dictionary page size >> > > configuration >> > > > > 5x, >> > > > > > > >>> with the >> > > > > > > >>> >> > same >> > > > > > > >>> >> > > result (no dictionary created). >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > So to summarize, my questions are: >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary >> writing >> > to >> > > > > > succeed >> > > > > > > >>> for a >> > > > > > > >>> >> > > given column? >> > > > > > > >>> >> > > - Is that heuristic configurable at all? >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a >> > > > > > > frequency-based >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data >> > > > follows >> > > > > a >> > > > > > > >>> certain >> > > > > > > >>> >> > > statistical distribution, we can create a >> dictionary >> > of >> > > > the >> > > > > > most >> > > > > > > >>> >> frequent >> > > > > > > >>> >> > > values only? >> > > > > > > >>> >> > > >> > > > > > > >>> >> > > Thanks for your time! >> > > > > > > >>> >> > > - Claire >> > > > > > > >>> >> > > >> > > > > > > >>> >> > >> > > > > > > >>> >> >> > > > > > > >>> > >> > > > > > > >>> >> > > > > > > >> >> > > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy >> > > > > > >> > > > > >> > > > >> > > >> > >> >
