Thanks for your response Jason. That helps too.

Regards,
~Pratik

On Thu, Sep 18, 2014 at 10:19 AM, Jason Altekruse <[email protected]>
wrote:

> Disregard my last message, gmail wasn't showing me the new messages, so I
> hadn't seen Julien's response.
>
> On Thu, Sep 18, 2014 at 10:15 AM, Jason Altekruse <
> [email protected]>
> wrote:
>
> > Hi Pratik,
> >
> > You are correct that the overhead makes reading small files less
> feasible,
> > but I wouldn't necessarily say that it the overhead is any worse than
> > usual. Parquet is optimized for reading a lot of data. If the data was
> > compressed the parquet reader would have been running a block
> decompression
> > algorithm on very small lengths of bytes in your example case, a very
> > non-optimal use of such an algorithm. In any case where your application
> is
> > producing such small datasets that it wants to persist to disk, you
> should
> > not be using Parquet, these types of files should be aggregated by some
> > kind of ETL process into one large Parquet file for long term storage and
> > fast analytical processing.
> >
> > -Jasin
> >
> > On Wed, Sep 17, 2014 at 6:36 PM, pratik khadloya <[email protected]>
> > wrote:
> >
> >> Hello,
> >>
> >> Does anyone know if the Parquet format is generally not suited well or
> >> slow
> >> for reading and writing VARCHAR fields? I am currently investigating why
> >> it
> >> takes longer to read a parquet file which has 5 cols BIGINT(20),
> >> BIGINT(20), SMALLINT(6), SMALLINT(6), VARCHAR(255) than reading a simple
> >> csv file.
> >>
> >> For reading ALL the columns, It takes about 2ms to read a csv file vs
> >> 650ms
> >> for a Parquet file with the same data. There are only 700 rows in the
> >> table.
> >>
> >> Does anyone have any information about it?
> >> I suspect the overhead of parquet format is more for smaller files.
> >>
> >> Thanks,
> >> Pratik
> >>
> >
> >
>

Reply via email to