You probably need to be more specific on which language bindings you are using. I think the C++ community is just starting to work on being able to write out bloom filters (so it isn't supported in C++, Python and R, Ruby, etc).
The way I read the specification, yes each single value should be added to the bloom filter independently (but it could also be that this is a gray area where repeated fields were not considered). On Wed, Jun 7, 2023 at 12:38 PM Marco Colli <[email protected]> wrote: > @Micah Does that mean that columns of type array already get a bloom filter > on each single value? > I am using Apache Arrow in particular to deal with Parquet files > > Il Mer 7 Giu 2023, 16:00 Micah Kornfield <[email protected]> ha > scritto: > > > Hi Marco, > > Could you describe how your proposal differs from tokenizing the target > > string and storing the list of tokens in a column that has a bloom filter > > attached? I think this should be supportable today by the format at > least > > if not existing libraries. > > > > Thanks, > > Micah > > > > On Wednesday, June 7, 2023, Gang Wu <[email protected]> wrote: > > > > > Hi Marco, > > > > > > That sounds interesting! > > > > > > However, this requires the parquet implementation to be able to > tokenize > > > both > > > strings to write and literals in the filters. The actual efficiency > > depends > > > on the > > > data distribution. I am also concerned with the possible explosion of > > > distinct > > > values introduced by splitting words, which may result in a large bloom > > > filter. > > > > > > Have you tried any PoC to get a rough estimate of benefits in your use > > > case? > > > > > > Best, > > > Gang > > > > > > > > > > > > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <[email protected]> > > wrote: > > > > > > > Hello, > > > > > > > > I see that Parquet already supports Bloom filters. > > > > > > > > For my understanding, it currently uses them only on the entire > value. > > > > > > > > Fo example, if I have a column "MovieTitle": > > > > > > > > - "The title of my movie" > > > > - "Another movie title" > > > > - "The best movie title" > > > > - ... > > > > > > > > Then the current Bloom filters can be used to find only the column > > > > chunks/pages that match an exact title. For example you can use the > > bloom > > > > filter to search for "The best movie title". > > > > > > > > It would be interesting to have *a bloom filter on the specific > words*, > > > > instead of using the entire value: in this way you can search the > word > > > > "best" in the "MovieTitle" column and find the titles that contain > that > > > > specific word in an efficient way. > > > > > > > > It would enable a sort of full-text search of keywords inside text > > > columns. > > > > It would also allow predicate pushdown for searches based on > keywords. > > > > > > > > Would make sense to have such an addition? Is there any strategy > > already > > > > used by Parquet for fast keyword searches inside text columns? > > > > > > > > > > > > Best regards, > > > > Marco Colli > > > > AbstractBrain srls > > > > > > > > > >
