Re: bloomfilter and tokenisation

Wes McKinney Wed, 12 Jun 2019 10:25:51 -0700

Hi Manik,

You could store "raw" as a LIST<BYTE_ARRAY> (so you have to tokenize
in your ETL step) instead of BYTE_ARRAY and you then reap dictionary
encoding benefits.


- Wes

On Wed, Jun 12, 2019 at 12:08 PM Manik Singla <smanik...@gmail.com> wrote:
>
> could someone guide on this one
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Tue, Jun 11, 2019 at 5:58 PM Manik Singla <smanik...@gmail.com> wrote:
>
> > Hey Team
> >
> > I have started using parquet recently.
> >
> > Kind of data I save is something like
> >
> > *raw   hostname cluster serviceName  *
> >
> > where raw is actual log lines.
> >
> > For raw, dictionary doesn't work as we no 2 log lines are same. But if we
> > tokenise terms in dictionary, then dictionary can help here to filter out
> > unwanted rows.  For example, parquet is a columnar format will become
> > "parquet", "is", "a", "columnar", "format".
> >
> > Also, I see mention of merging bloomfilter not sure if we considering
> > tokenisation there.
> >
> > Do we support some out of box to way to tokenise text before dictionary
> >
> > Also, what are your views if we think to add it
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
> >

Re: bloomfilter and tokenisation

Reply via email to