This past week I actually tried out indexing Parquet externally (storing split + row offset info) in Elasticsearch. This worked out ok, but it wasn't as fast as it could have been - a seek for that doesn't materialize intermediate records would have been great. Additionally, it would be helpful to expose the row group by ID or something - relying on split was problematic when filter pushdown was enabled.
On Mon, Jul 20, 2015 at 5:34 PM, Alex Levenson < [email protected]> wrote: > I think it'd be a good idea to also discuss metadata storage in parquet as > well. > Right now we jam things into data pages (dictionary header) and file > footers (statistics objects), neither of which are good for fast queries. > If we hide metadata storage behind an interface, we could start doing > things like storing stats + dictionaries + indexes in a database (or > sqllite file or w/e), and at job submit-time we could query the database to > do fast push-down filters and such. > > Another discussion worth having is full inverted-indexes (like how Lucene > works) for search use cases. It might even be interesting to see if parquet > can interop w/ Lucene (there's actually a lot of similarities between > parquet and lucene when it comes to data storage). > > On Thu, Jul 9, 2015 at 9:56 AM, Ryan Blue <[email protected]> wrote: > > > Lately, we've been discussing bloom filters on PARQUET-41. It looks like > a > > good option for that is to put them into some form of index page that can > > be used or ignored. > > > > I'm really interested in hearing about the other ideas that have been > > thrown out on this. > > > > rb > > > > On 07/09/2015 09:22 AM, Jacques Nadeau wrote: > > > >> I think we should start a design discussion around this. I think there > >> were early ideas by some of the initial authors. However, I don't think > >> it > >> has been designed. > >> On Jul 9, 2015 9:16 AM, "Patrick Woody" <[email protected]> > wrote: > >> > >> Just wanted to follow up here. Is there any information on index pages > >>> available? > >>> > >>> On Thu, Jul 2, 2015 at 4:22 PM, Patrick Woody < > [email protected]> > >>> wrote: > >>> > >>> Hey all, > >>>> > >>>> I've seen various mentions about Parquet index pages in the docs and > >>>> various slides/talks. Is there any up to date resource on what the > plan > >>>> > >>> for > >>> > >>>> these are? > >>>> > >>>> Thanks! > >>>> -Pat > >>>> > >>>> > >>> > >> > > > > -- > > Ryan Blue > > Software Engineer > > Cloudera, Inc. > > > > > > -- > Alex Levenson > @THISWILLWORK >
