Hi Dmitriy,

Currently the implementation supports this behavior in both the reader and
the writer, so it is technically designed to handle it. However I was under
the impression that part of the motivation of the page abstraction was
defining a maximum amount of data needed to decompress if you wanted to do
a point query into a parquet file (assuming you had some kind of index
which told you which pages to look in to find a particular range of values
for each column).

We are working on adding support for repeated columns that exhibit this
behavior, as it has been part of parquet since the initial release. However
it seemed like it might have been a design oversight. I don't think there
need to be any additional overhead to support the change, if we would
change the model a little bit to process repetition and definition levels
up to the point where a record completes, and then at that time copy all of
the data in a tight loop, the performance should be nearly identical to the
current implementation, if not a little better because all of the copies
are a little closer together which may help the CPU memory cache. If this
is the case it might be worth saving up several of the lists from
successive records as the page data buffer will be able to stay in the
cache longer.

-Jason



On Mon, Jun 30, 2014 at 12:17 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Not sure I have all the details straight, but it seems like this caching
> can be problematic for very large lists. Is there a way to design this so
> it can span pages? Or does this not make sense since a single record has to
> fit in a page?
>
>
>
>
> On Mon, Jun 30, 2014 at 9:10 AM, Jason Altekruse <[email protected]
> >
> wrote:
>
> > Hello Parquet devs,
> >
> > I have been working more on the Drill implementation of parquet to bring
> us
> > up to full read compatibility as well as implement write support. We are
> > using the RecordConsumer interface to write data into parquet files, and
> it
> > seems that we have hit a bug when writing repeated data.
> >
> > I am currently just doing a simple test with a repeated field at the root
> > of the schema. I am writing in data I am pulling in from a json file,
> where
> > each record contains one repeated Long column with seven items. The
> problem
> > is appearing when we hit one of the page thresholds, the ColumnWriterImpl
> > is writing only the values from one of the lists that fit in the current
> > page, not the entire list. Thus the 'value' within that column is being
> > split across two pages. I took a look at the source and it does not look
> > like the ColumnWriterImp is actually ensuring that a list ends before
> > cutting off the page. With the implementation of the repetition levels, I
> > believe this can only be indicated by actually reading one value from the
> > next list (when the repetition level hits 0). It seems like the actual
> > writes to the page data should be cached inside of the writer until it is
> > determines that the entire list of values will fit in the page.
> >
> > Is there something that I am missing? I did not find any open issues for
> > this, but I will work on a patch to see if I can get it working for me.
> >
> > Jason Altekruse
> > MapR - Software Engineer
> > Apache Drill Team
> >
>

Reply via email to