I think it'd be a good idea to also discuss metadata storage in parquet as well. Right now we jam things into data pages (dictionary header) and file footers (statistics objects), neither of which are good for fast queries. If we hide metadata storage behind an interface, we could start doing things like storing stats + dictionaries + indexes in a database (or sqllite file or w/e), and at job submit-time we could query the database to do fast push-down filters and such.
Another discussion worth having is full inverted-indexes (like how Lucene works) for search use cases. It might even be interesting to see if parquet can interop w/ Lucene (there's actually a lot of similarities between parquet and lucene when it comes to data storage). On Thu, Jul 9, 2015 at 9:56 AM, Ryan Blue <[email protected]> wrote: > Lately, we've been discussing bloom filters on PARQUET-41. It looks like a > good option for that is to put them into some form of index page that can > be used or ignored. > > I'm really interested in hearing about the other ideas that have been > thrown out on this. > > rb > > On 07/09/2015 09:22 AM, Jacques Nadeau wrote: > >> I think we should start a design discussion around this. I think there >> were early ideas by some of the initial authors. However, I don't think >> it >> has been designed. >> On Jul 9, 2015 9:16 AM, "Patrick Woody" <[email protected]> wrote: >> >> Just wanted to follow up here. Is there any information on index pages >>> available? >>> >>> On Thu, Jul 2, 2015 at 4:22 PM, Patrick Woody <[email protected]> >>> wrote: >>> >>> Hey all, >>>> >>>> I've seen various mentions about Parquet index pages in the docs and >>>> various slides/talks. Is there any up to date resource on what the plan >>>> >>> for >>> >>>> these are? >>>> >>>> Thanks! >>>> -Pat >>>> >>>> >>> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. > -- Alex Levenson @THISWILLWORK
