You probably need to be more specific on which language bindings you are
using.  I think the C++ community is just starting to work on being able to
write out bloom filters (so it isn't supported in C++, Python and R, Ruby,
etc).

The way I read the specification, yes each single value should be added to
the bloom filter independently (but it could also be that this is a gray
area where repeated fields were not considered).



On Wed, Jun 7, 2023 at 12:38 PM Marco Colli <[email protected]> wrote:

> @Micah Does that mean that columns of type array already get a bloom filter
> on each single value?
> I am using Apache Arrow in particular to deal with Parquet files
>
> Il Mer 7 Giu 2023, 16:00 Micah Kornfield <[email protected]> ha
> scritto:
>
> > Hi Marco,
> > Could you describe how your proposal differs from tokenizing the target
> > string and storing the list of tokens in a column that has a bloom filter
> > attached?  I think this should be supportable today by the format at
> least
> > if not existing libraries.
> >
> > Thanks,
> > Micah
> >
> > On Wednesday, June 7, 2023, Gang Wu <[email protected]> wrote:
> >
> > > Hi Marco,
> > >
> > > That sounds interesting!
> > >
> > > However, this requires the parquet implementation to be able to
> tokenize
> > > both
> > > strings to write and literals in the filters. The actual efficiency
> > depends
> > > on the
> > > data distribution. I am also concerned with the possible explosion of
> > > distinct
> > > values introduced by splitting words, which may result in a large bloom
> > > filter.
> > >
> > > Have you tried any PoC to get a rough estimate of benefits in your use
> > > case?
> > >
> > > Best,
> > > Gang
> > >
> > >
> > >
> > > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <[email protected]>
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > I see that Parquet already supports Bloom filters.
> > > >
> > > > For my understanding, it currently uses them only on the entire
> value.
> > > >
> > > > Fo example, if I have a column "MovieTitle":
> > > >
> > > > - "The title of my movie"
> > > > - "Another movie title"
> > > > - "The best movie title"
> > > > - ...
> > > >
> > > > Then the current Bloom filters can be used to find only the column
> > > > chunks/pages that match an exact title. For example you can use the
> > bloom
> > > > filter to search for "The best movie title".
> > > >
> > > > It would be interesting to have *a bloom filter on the specific
> words*,
> > > > instead of using the entire value: in this way you can search the
> word
> > > > "best" in the "MovieTitle" column and find the titles that contain
> that
> > > > specific word in an efficient way.
> > > >
> > > > It would enable a sort of full-text search of keywords inside text
> > > columns.
> > > > It would also allow predicate pushdown for searches based on
> keywords.
> > > >
> > > > Would make sense to have such an addition? Is there any strategy
> > already
> > > > used by Parquet for fast keyword searches inside text columns?
> > > >
> > > >
> > > > Best regards,
> > > > Marco Colli
> > > > AbstractBrain srls
> > > >
> > >
> >
>

Reply via email to