Hi Manik, You could store "raw" as a LIST<BYTE_ARRAY> (so you have to tokenize in your ETL step) instead of BYTE_ARRAY and you then reap dictionary encoding benefits.
- Wes On Wed, Jun 12, 2019 at 12:08 PM Manik Singla <smanik...@gmail.com> wrote: > > could someone guide on this one > > Regards > Manik Singla > +91-9996008893 > +91-9665639677 > > "Life doesn't consist in holding good cards but playing those you hold > well." > > > On Tue, Jun 11, 2019 at 5:58 PM Manik Singla <smanik...@gmail.com> wrote: > > > Hey Team > > > > I have started using parquet recently. > > > > Kind of data I save is something like > > > > *raw hostname cluster serviceName * > > > > where raw is actual log lines. > > > > For raw, dictionary doesn't work as we no 2 log lines are same. But if we > > tokenise terms in dictionary, then dictionary can help here to filter out > > unwanted rows. For example, parquet is a columnar format will become > > "parquet", "is", "a", "columnar", "format". > > > > Also, I see mention of merging bloomfilter not sure if we considering > > tokenisation there. > > > > Do we support some out of box to way to tokenise text before dictionary > > > > Also, what are your views if we think to add it > > > > Regards > > Manik Singla > > +91-9996008893 > > +91-9665639677 > > > > "Life doesn't consist in holding good cards but playing those you hold > > well." > >