Re: parquet using encoding other than UTF-8

2019-02-07 Thread Manik Singla
thanks all, I decided to convert to UTF-8 only.

Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Thu, Feb 7, 2019 at 5:41 AM Ryan Blue  wrote:

> Ok, thanks. Sorry I misread what you meant!
>
> On Wed, Feb 6, 2019 at 3:02 PM Wes McKinney  wrote:
>
> > > I think the right thing is to transcode your data to UTF-8.
> >
> > I agree -- I wasn't recommending the approach I described, just to say
> > that it is not impossible.
> >
> > On Wed, Feb 6, 2019 at 12:10 PM Ryan Blue 
> > wrote:
> > >
> > > I disagree with Wes. He's right that you *could* just use binary and
> keep
> > > extra metadata somewhere, it is very unlikely that Parquet would ever
> > > support such a scheme. And it is bad for the community when people
> > attempt
> > > to go around the format spec, as we see with the INT96 timestamp mess.
> > >
> > > I think the right thing is to transcode your data to UTF-8. That should
> > not
> > > be at all hard in any language, will create standards-compliant Parquet
> > > that can be used in any processing engine, and you'll save space when
> it
> > is
> > > stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact
> that
> > > you don't control the input data encoding doesn't mean you can't write
> it
> > > to Parquet correctly.
> > >
> > > rb
> > >
> > > On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney 
> wrote:
> > >
> > > > hi Manik -- you are free to store the data as opaque binary as
> > > > BYTE_ARRAY type and add some metadata to the schema so your readers
> > > > can recognize that it's UTF-16 stored as binary
> > > >
> > > > On Wed, Feb 6, 2019 at 12:24 AM Manik Singla 
> > wrote:
> > > > >
> > > > > I am not the producer of data so I can not control encoding. I do
> > receive
> > > > > ByteBuffer and encoding.
> > > > > I can decode data with given encoding and covert to UTF-8 for
> storing
> > > > with
> > > > > Parquet.
> > > > > I was thinking to remove that overhead if possible
> > > > >
> > > > > Regards
> > > > > Manik Singla
> > > > > +91-9996008893
> > > > > +91-9665639677
> > > > >
> > > > > "Life doesn't consist in holding good cards but playing those you
> > hold
> > > > > well."
> > > > >
> > > > >
> > > > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn 
> > wrote:
> > > > >
> > > > > > Hello Manik,
> > > > > >
> > > > > > this is not possible at the moment. As Parquet is a portable
> > on-disk
> > > > > > format, we focus on having a single representation for each data
> > type.
> > > > Thus
> > > > > > implementing readers/writers is limited to these to make their
> > > > > > implementation simpler. Especially as you are the producer but
> not
> > the
> > > > > > consumer, even adding a new type would not solve your problem.
> You
> > > > really
> > > > > > can only use a new logical type when it has been implemented in
> > all the
> > > > > > readers and your consumers have all updated to these reader
> > versions.
> > > > > >
> > > > > > As Unicode and thus UTF-8 support all characters one can think
> > off, you
> > > > > > should always be able to convert strings to it. Given that
> Parquet
> > > > files
> > > > > > encode and compress the data anyway afterwards, the conversion
> is a
> > > > bit of
> > > > > > a CPU overhead but should not make a difference in size and form
> > of the
> > > > > > data actually stored in the files. Also I guess that the
> > UTF-16->UTF-8
> > > > > > conversion costs less CPU that the Parquet compression process.
> > > > > >
> > > > > > Did this help you or is there any reason why you really cannot
> > convert
> > > > > > your data to UTF-8?
> > > > > >
> > > > > > Uwe
> > > > > >
> > > > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> > > > > > > Hi
> > > > > > >
> > > > > > > I am new to Parquet. I am trying to save UTF-16 or some other
> > > > encoding
> > > > > > than
> > > > > > > UTF-8.
> > > > > > > I am also trying to use encoding hint when saving ByteBuffer.
> > > > > > >
> > > > > > > I don't find way to use any thing other than UTF-8.
> > > > > > >
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> > > > > > says
> > > > > > > we can extend primitive types to solve cases.
> > > > > > >
> > > > > > > Other thing I want to mention is I am only the producer of
> > parquet
> > > > file
> > > > > > but
> > > > > > > not consumer.
> > > > > > >
> > > > > > > Could you guide me which examples I can look into or which will
> > be
> > > > right
> > > > > > way
> > > > > > >
> > > > > > >
> > > > > > > Regards
> > > > > > > Manik Singla
> > > > > > > +91-9996008893
> > > > > > > +91-9665639677
> > > > > > >
> > > > > > > "Life doesn't consist in holding good cards but playing those
> you
> > > > hold
> > > > > > > well."
> > > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: parquet using encoding other than UTF-8

2019-02-06 Thread Ryan Blue
Ok, thanks. Sorry I misread what you meant!

On Wed, Feb 6, 2019 at 3:02 PM Wes McKinney  wrote:

> > I think the right thing is to transcode your data to UTF-8.
>
> I agree -- I wasn't recommending the approach I described, just to say
> that it is not impossible.
>
> On Wed, Feb 6, 2019 at 12:10 PM Ryan Blue 
> wrote:
> >
> > I disagree with Wes. He's right that you *could* just use binary and keep
> > extra metadata somewhere, it is very unlikely that Parquet would ever
> > support such a scheme. And it is bad for the community when people
> attempt
> > to go around the format spec, as we see with the INT96 timestamp mess.
> >
> > I think the right thing is to transcode your data to UTF-8. That should
> not
> > be at all hard in any language, will create standards-compliant Parquet
> > that can be used in any processing engine, and you'll save space when it
> is
> > stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact that
> > you don't control the input data encoding doesn't mean you can't write it
> > to Parquet correctly.
> >
> > rb
> >
> > On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney  wrote:
> >
> > > hi Manik -- you are free to store the data as opaque binary as
> > > BYTE_ARRAY type and add some metadata to the schema so your readers
> > > can recognize that it's UTF-16 stored as binary
> > >
> > > On Wed, Feb 6, 2019 at 12:24 AM Manik Singla 
> wrote:
> > > >
> > > > I am not the producer of data so I can not control encoding. I do
> receive
> > > > ByteBuffer and encoding.
> > > > I can decode data with given encoding and covert to UTF-8 for storing
> > > with
> > > > Parquet.
> > > > I was thinking to remove that overhead if possible
> > > >
> > > > Regards
> > > > Manik Singla
> > > > +91-9996008893
> > > > +91-9665639677
> > > >
> > > > "Life doesn't consist in holding good cards but playing those you
> hold
> > > > well."
> > > >
> > > >
> > > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn 
> wrote:
> > > >
> > > > > Hello Manik,
> > > > >
> > > > > this is not possible at the moment. As Parquet is a portable
> on-disk
> > > > > format, we focus on having a single representation for each data
> type.
> > > Thus
> > > > > implementing readers/writers is limited to these to make their
> > > > > implementation simpler. Especially as you are the producer but not
> the
> > > > > consumer, even adding a new type would not solve your problem. You
> > > really
> > > > > can only use a new logical type when it has been implemented in
> all the
> > > > > readers and your consumers have all updated to these reader
> versions.
> > > > >
> > > > > As Unicode and thus UTF-8 support all characters one can think
> off, you
> > > > > should always be able to convert strings to it. Given that Parquet
> > > files
> > > > > encode and compress the data anyway afterwards, the conversion is a
> > > bit of
> > > > > a CPU overhead but should not make a difference in size and form
> of the
> > > > > data actually stored in the files. Also I guess that the
> UTF-16->UTF-8
> > > > > conversion costs less CPU that the Parquet compression process.
> > > > >
> > > > > Did this help you or is there any reason why you really cannot
> convert
> > > > > your data to UTF-8?
> > > > >
> > > > > Uwe
> > > > >
> > > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> > > > > > Hi
> > > > > >
> > > > > > I am new to Parquet. I am trying to save UTF-16 or some other
> > > encoding
> > > > > than
> > > > > > UTF-8.
> > > > > > I am also trying to use encoding hint when saving ByteBuffer.
> > > > > >
> > > > > > I don't find way to use any thing other than UTF-8.
> > > > > >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> > > > > says
> > > > > > we can extend primitive types to solve cases.
> > > > > >
> > > > > > Other thing I want to mention is I am only the producer of
> parquet
> > > file
> > > > > but
> > > > > > not consumer.
> > > > > >
> > > > > > Could you guide me which examples I can look into or which will
> be
> > > right
> > > > > way
> > > > > >
> > > > > >
> > > > > > Regards
> > > > > > Manik Singla
> > > > > > +91-9996008893
> > > > > > +91-9665639677
> > > > > >
> > > > > > "Life doesn't consist in holding good cards but playing those you
> > > hold
> > > > > > well."
> > > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: parquet using encoding other than UTF-8

2019-02-06 Thread Wes McKinney
> I think the right thing is to transcode your data to UTF-8.

I agree -- I wasn't recommending the approach I described, just to say
that it is not impossible.

On Wed, Feb 6, 2019 at 12:10 PM Ryan Blue  wrote:
>
> I disagree with Wes. He's right that you *could* just use binary and keep
> extra metadata somewhere, it is very unlikely that Parquet would ever
> support such a scheme. And it is bad for the community when people attempt
> to go around the format spec, as we see with the INT96 timestamp mess.
>
> I think the right thing is to transcode your data to UTF-8. That should not
> be at all hard in any language, will create standards-compliant Parquet
> that can be used in any processing engine, and you'll save space when it is
> stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact that
> you don't control the input data encoding doesn't mean you can't write it
> to Parquet correctly.
>
> rb
>
> On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney  wrote:
>
> > hi Manik -- you are free to store the data as opaque binary as
> > BYTE_ARRAY type and add some metadata to the schema so your readers
> > can recognize that it's UTF-16 stored as binary
> >
> > On Wed, Feb 6, 2019 at 12:24 AM Manik Singla  wrote:
> > >
> > > I am not the producer of data so I can not control encoding. I do receive
> > > ByteBuffer and encoding.
> > > I can decode data with given encoding and covert to UTF-8 for storing
> > with
> > > Parquet.
> > > I was thinking to remove that overhead if possible
> > >
> > > Regards
> > > Manik Singla
> > > +91-9996008893
> > > +91-9665639677
> > >
> > > "Life doesn't consist in holding good cards but playing those you hold
> > > well."
> > >
> > >
> > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn  wrote:
> > >
> > > > Hello Manik,
> > > >
> > > > this is not possible at the moment. As Parquet is a portable on-disk
> > > > format, we focus on having a single representation for each data type.
> > Thus
> > > > implementing readers/writers is limited to these to make their
> > > > implementation simpler. Especially as you are the producer but not the
> > > > consumer, even adding a new type would not solve your problem. You
> > really
> > > > can only use a new logical type when it has been implemented in all the
> > > > readers and your consumers have all updated to these reader versions.
> > > >
> > > > As Unicode and thus UTF-8 support all characters one can think off, you
> > > > should always be able to convert strings to it. Given that Parquet
> > files
> > > > encode and compress the data anyway afterwards, the conversion is a
> > bit of
> > > > a CPU overhead but should not make a difference in size and form of the
> > > > data actually stored in the files. Also I guess that the UTF-16->UTF-8
> > > > conversion costs less CPU that the Parquet compression process.
> > > >
> > > > Did this help you or is there any reason why you really cannot convert
> > > > your data to UTF-8?
> > > >
> > > > Uwe
> > > >
> > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> > > > > Hi
> > > > >
> > > > > I am new to Parquet. I am trying to save UTF-16 or some other
> > encoding
> > > > than
> > > > > UTF-8.
> > > > > I am also trying to use encoding hint when saving ByteBuffer.
> > > > >
> > > > > I don't find way to use any thing other than UTF-8.
> > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> > > > says
> > > > > we can extend primitive types to solve cases.
> > > > >
> > > > > Other thing I want to mention is I am only the producer of parquet
> > file
> > > > but
> > > > > not consumer.
> > > > >
> > > > > Could you guide me which examples I can look into or which will be
> > right
> > > > way
> > > > >
> > > > >
> > > > > Regards
> > > > > Manik Singla
> > > > > +91-9996008893
> > > > > +91-9665639677
> > > > >
> > > > > "Life doesn't consist in holding good cards but playing those you
> > hold
> > > > > well."
> > > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


Re: parquet using encoding other than UTF-8

2019-02-06 Thread Ryan Blue
I disagree with Wes. He's right that you *could* just use binary and keep
extra metadata somewhere, it is very unlikely that Parquet would ever
support such a scheme. And it is bad for the community when people attempt
to go around the format spec, as we see with the INT96 timestamp mess.

I think the right thing is to transcode your data to UTF-8. That should not
be at all hard in any language, will create standards-compliant Parquet
that can be used in any processing engine, and you'll save space when it is
stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact that
you don't control the input data encoding doesn't mean you can't write it
to Parquet correctly.

rb

On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney  wrote:

> hi Manik -- you are free to store the data as opaque binary as
> BYTE_ARRAY type and add some metadata to the schema so your readers
> can recognize that it's UTF-16 stored as binary
>
> On Wed, Feb 6, 2019 at 12:24 AM Manik Singla  wrote:
> >
> > I am not the producer of data so I can not control encoding. I do receive
> > ByteBuffer and encoding.
> > I can decode data with given encoding and covert to UTF-8 for storing
> with
> > Parquet.
> > I was thinking to remove that overhead if possible
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
> >
> >
> > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn  wrote:
> >
> > > Hello Manik,
> > >
> > > this is not possible at the moment. As Parquet is a portable on-disk
> > > format, we focus on having a single representation for each data type.
> Thus
> > > implementing readers/writers is limited to these to make their
> > > implementation simpler. Especially as you are the producer but not the
> > > consumer, even adding a new type would not solve your problem. You
> really
> > > can only use a new logical type when it has been implemented in all the
> > > readers and your consumers have all updated to these reader versions.
> > >
> > > As Unicode and thus UTF-8 support all characters one can think off, you
> > > should always be able to convert strings to it. Given that Parquet
> files
> > > encode and compress the data anyway afterwards, the conversion is a
> bit of
> > > a CPU overhead but should not make a difference in size and form of the
> > > data actually stored in the files. Also I guess that the UTF-16->UTF-8
> > > conversion costs less CPU that the Parquet compression process.
> > >
> > > Did this help you or is there any reason why you really cannot convert
> > > your data to UTF-8?
> > >
> > > Uwe
> > >
> > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> > > > Hi
> > > >
> > > > I am new to Parquet. I am trying to save UTF-16 or some other
> encoding
> > > than
> > > > UTF-8.
> > > > I am also trying to use encoding hint when saving ByteBuffer.
> > > >
> > > > I don't find way to use any thing other than UTF-8.
> > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> > > says
> > > > we can extend primitive types to solve cases.
> > > >
> > > > Other thing I want to mention is I am only the producer of parquet
> file
> > > but
> > > > not consumer.
> > > >
> > > > Could you guide me which examples I can look into or which will be
> right
> > > way
> > > >
> > > >
> > > > Regards
> > > > Manik Singla
> > > > +91-9996008893
> > > > +91-9665639677
> > > >
> > > > "Life doesn't consist in holding good cards but playing those you
> hold
> > > > well."
> > >
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: parquet using encoding other than UTF-8

2019-02-06 Thread Wes McKinney
hi Manik -- you are free to store the data as opaque binary as
BYTE_ARRAY type and add some metadata to the schema so your readers
can recognize that it's UTF-16 stored as binary

On Wed, Feb 6, 2019 at 12:24 AM Manik Singla  wrote:
>
> I am not the producer of data so I can not control encoding. I do receive
> ByteBuffer and encoding.
> I can decode data with given encoding and covert to UTF-8 for storing with
> Parquet.
> I was thinking to remove that overhead if possible
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn  wrote:
>
> > Hello Manik,
> >
> > this is not possible at the moment. As Parquet is a portable on-disk
> > format, we focus on having a single representation for each data type. Thus
> > implementing readers/writers is limited to these to make their
> > implementation simpler. Especially as you are the producer but not the
> > consumer, even adding a new type would not solve your problem. You really
> > can only use a new logical type when it has been implemented in all the
> > readers and your consumers have all updated to these reader versions.
> >
> > As Unicode and thus UTF-8 support all characters one can think off, you
> > should always be able to convert strings to it. Given that Parquet files
> > encode and compress the data anyway afterwards, the conversion is a bit of
> > a CPU overhead but should not make a difference in size and form of the
> > data actually stored in the files. Also I guess that the UTF-16->UTF-8
> > conversion costs less CPU that the Parquet compression process.
> >
> > Did this help you or is there any reason why you really cannot convert
> > your data to UTF-8?
> >
> > Uwe
> >
> > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> > > Hi
> > >
> > > I am new to Parquet. I am trying to save UTF-16 or some other encoding
> > than
> > > UTF-8.
> > > I am also trying to use encoding hint when saving ByteBuffer.
> > >
> > > I don't find way to use any thing other than UTF-8.
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> > says
> > > we can extend primitive types to solve cases.
> > >
> > > Other thing I want to mention is I am only the producer of parquet file
> > but
> > > not consumer.
> > >
> > > Could you guide me which examples I can look into or which will be right
> > way
> > >
> > >
> > > Regards
> > > Manik Singla
> > > +91-9996008893
> > > +91-9665639677
> > >
> > > "Life doesn't consist in holding good cards but playing those you hold
> > > well."
> >


Re: parquet using encoding other than UTF-8

2019-02-05 Thread Manik Singla
I am not the producer of data so I can not control encoding. I do receive
ByteBuffer and encoding.
I can decode data with given encoding and covert to UTF-8 for storing with
Parquet.
I was thinking to remove that overhead if possible

Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn  wrote:

> Hello Manik,
>
> this is not possible at the moment. As Parquet is a portable on-disk
> format, we focus on having a single representation for each data type. Thus
> implementing readers/writers is limited to these to make their
> implementation simpler. Especially as you are the producer but not the
> consumer, even adding a new type would not solve your problem. You really
> can only use a new logical type when it has been implemented in all the
> readers and your consumers have all updated to these reader versions.
>
> As Unicode and thus UTF-8 support all characters one can think off, you
> should always be able to convert strings to it. Given that Parquet files
> encode and compress the data anyway afterwards, the conversion is a bit of
> a CPU overhead but should not make a difference in size and form of the
> data actually stored in the files. Also I guess that the UTF-16->UTF-8
> conversion costs less CPU that the Parquet compression process.
>
> Did this help you or is there any reason why you really cannot convert
> your data to UTF-8?
>
> Uwe
>
> On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> > Hi
> >
> > I am new to Parquet. I am trying to save UTF-16 or some other encoding
> than
> > UTF-8.
> > I am also trying to use encoding hint when saving ByteBuffer.
> >
> > I don't find way to use any thing other than UTF-8.
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> says
> > we can extend primitive types to solve cases.
> >
> > Other thing I want to mention is I am only the producer of parquet file
> but
> > not consumer.
> >
> > Could you guide me which examples I can look into or which will be right
> way
> >
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
>


Re: parquet using encoding other than UTF-8

2019-02-05 Thread Uwe L. Korn
Hello Manik,

this is not possible at the moment. As Parquet is a portable on-disk format, we 
focus on having a single representation for each data type. Thus implementing 
readers/writers is limited to these to make their implementation simpler. 
Especially as you are the producer but not the consumer, even adding a new type 
would not solve your problem. You really can only use a new logical type when 
it has been implemented in all the readers and your consumers have all updated 
to these reader versions.

As Unicode and thus UTF-8 support all characters one can think off, you should 
always be able to convert strings to it. Given that Parquet files encode and 
compress the data anyway afterwards, the conversion is a bit of a CPU overhead 
but should not make a difference in size and form of the data actually stored 
in the files. Also I guess that the UTF-16->UTF-8 conversion costs less CPU 
that the Parquet compression process. 

Did this help you or is there any reason why you really cannot convert your 
data to UTF-8?

Uwe

On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> Hi
> 
> I am new to Parquet. I am trying to save UTF-16 or some other encoding than
> UTF-8.
> I am also trying to use encoding hint when saving ByteBuffer.
> 
> I don't find way to use any thing other than UTF-8.
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says
> we can extend primitive types to solve cases.
> 
> Other thing I want to mention is I am only the producer of parquet file but
> not consumer.
> 
> Could you guide me which examples I can look into or which will be right way
> 
> 
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
> 
> "Life doesn't consist in holding good cards but playing those you hold
> well."


parquet using encoding other than UTF-8

2019-02-05 Thread Manik Singla
Hi

I am new to Parquet. I am trying to save UTF-16 or some other encoding than
UTF-8.
I am also trying to use encoding hint when saving ByteBuffer.

I don't find way to use any thing other than UTF-8.
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says
we can extend primitive types to solve cases.

Other thing I want to mention is I am only the producer of parquet file but
not consumer.

Could you guide me which examples I can look into or which will be right way


Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."