This past week I actually tried out indexing Parquet externally (storing
split + row offset info) in Elasticsearch. This worked out ok, but it
wasn't as fast as it could have been - a seek for that doesn't materialize
intermediate records would have been great. Additionally, it would be
helpful to expose the row group by ID or something - relying on split was
problematic when filter pushdown was enabled.

On Mon, Jul 20, 2015 at 5:34 PM, Alex Levenson <
[email protected]> wrote:

> I think it'd be a good idea to also discuss metadata storage in parquet as
> well.
> Right now we jam things into data pages (dictionary header) and file
> footers (statistics objects), neither of which are good for fast queries.
> If we hide metadata storage behind an interface, we could start doing
> things like storing stats + dictionaries + indexes in a database (or
> sqllite file or w/e), and at job submit-time we could query the database to
> do fast push-down filters and such.
>
> Another discussion worth having is full inverted-indexes (like how Lucene
> works) for search use cases. It might even be interesting to see if parquet
> can interop w/ Lucene (there's actually a lot of similarities between
> parquet and lucene when it comes to data storage).
>
> On Thu, Jul 9, 2015 at 9:56 AM, Ryan Blue <[email protected]> wrote:
>
> > Lately, we've been discussing bloom filters on PARQUET-41. It looks like
> a
> > good option for that is to put them into some form of index page that can
> > be used or ignored.
> >
> > I'm really interested in hearing about the other ideas that have been
> > thrown out on this.
> >
> > rb
> >
> > On 07/09/2015 09:22 AM, Jacques Nadeau wrote:
> >
> >> I think we should start a design discussion around this.  I think there
> >> were early ideas by some of the initial authors.  However, I don't think
> >> it
> >> has been designed.
> >> On Jul 9, 2015 9:16 AM, "Patrick Woody" <[email protected]>
> wrote:
> >>
> >>  Just wanted to follow up here. Is there any information on index pages
> >>> available?
> >>>
> >>> On Thu, Jul 2, 2015 at 4:22 PM, Patrick Woody <
> [email protected]>
> >>> wrote:
> >>>
> >>>  Hey all,
> >>>>
> >>>> I've seen various mentions about Parquet index pages in the docs and
> >>>> various slides/talks. Is there any up to date resource on what the
> plan
> >>>>
> >>> for
> >>>
> >>>> these are?
> >>>>
> >>>> Thanks!
> >>>> -Pat
> >>>>
> >>>>
> >>>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>

Reply via email to