tjwilson90 opened a new issue, #5108:
URL: https://github.com/apache/arrow-rs/issues/5108

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   I'm want to generate parquet files that efficiently support prefix, 
substring, and suffix queries on strings. E.g., given a column of strings, find 
all strings containing a given query as a substring as quickly as possible.
   
   Currently, as far as I know, this can't be done in the general case without 
loading all the data and doing an exhaustive scan.
   
   **Describe the solution you'd like**
   
   What I'd like to be able to do is build a bloom filter for my data 
consisting of ngrams. E.g., if my column contained the string "Hello, World!", 
and I was building a bloom filter with ngrams of length 4, I'd want to insert 
the sequences ["Hell", "ello", "llo,", ..., "rld!"] into a bloom filter. Later, 
when querying for a substring, I'd check if all ngrams of the same length were 
in the bloom filter. E.g., to search for "World", I would check whether the 
bloom filter contained "Worl" and "orld".
   
   I don't really want this exact feature to be implemented in this library 
because I actually want a bit of flexibility to change things (e.g., maybe I 
want to index additional ngrams containing sentinels for start of word and end 
of word, maybe I want to delay the sizing of the bloom filter until I observe 
how many distinct ngrams I have) and I don't necessarily think my solution 
described above is appropriate as a general solution for everyone. Instead, I 
want to have the ability to customize how bloom filters are built and then 
implement this logic myself.
   
   I'm not sure exactly how to best implement this, but I imagine something 
like having a trait
   ```
   pub trait SbbfBuilder {
       fn insert<T: AsBytes + ?Sized>(&mut self, value: &T);
       fn build(self) -> Sbbf;
   }
   ```
   allowing a `Box<dyn SbbfBuilder>` to be set on `ColumnProperties` and 
changing the `bloom_filter: Option<Sbbf>` in `ColumnValueEncoderImpl` and 
`ByteArrayEncoder` to instead be a `Option<Box<dyn SbbfBuilder>>` would work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to