Re: parquet using encoding other than UTF-8
thanks all, I decided to convert to UTF-8 only. Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn't consist in holding good cards but playing those you hold well." On Thu, Feb 7, 2019 at 5:41 AM Ryan Blue wrote: > Ok, thanks. Sorry I misread what you meant! > > On Wed, Feb 6, 2019 at 3:02 PM Wes McKinney wrote: > > > > I think the right thing is to transcode your data to UTF-8. > > > > I agree -- I wasn't recommending the approach I described, just to say > > that it is not impossible. > > > > On Wed, Feb 6, 2019 at 12:10 PM Ryan Blue > > wrote: > > > > > > I disagree with Wes. He's right that you *could* just use binary and > keep > > > extra metadata somewhere, it is very unlikely that Parquet would ever > > > support such a scheme. And it is bad for the community when people > > attempt > > > to go around the format spec, as we see with the INT96 timestamp mess. > > > > > > I think the right thing is to transcode your data to UTF-8. That should > > not > > > be at all hard in any language, will create standards-compliant Parquet > > > that can be used in any processing engine, and you'll save space when > it > > is > > > stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact > that > > > you don't control the input data encoding doesn't mean you can't write > it > > > to Parquet correctly. > > > > > > rb > > > > > > On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney > wrote: > > > > > > > hi Manik -- you are free to store the data as opaque binary as > > > > BYTE_ARRAY type and add some metadata to the schema so your readers > > > > can recognize that it's UTF-16 stored as binary > > > > > > > > On Wed, Feb 6, 2019 at 12:24 AM Manik Singla > > wrote: > > > > > > > > > > I am not the producer of data so I can not control encoding. I do > > receive > > > > > ByteBuffer and encoding. > > > > > I can decode data with given encoding and covert to UTF-8 for > storing > > > > with > > > > > Parquet. > > > > > I was thinking to remove that overhead if possible > > > > > > > > > > Regards > > > > > Manik Singla > > > > > +91-9996008893 > > > > > +91-9665639677 > > > > > > > > > > "Life doesn't consist in holding good cards but playing those you > > hold > > > > > well." > > > > > > > > > > > > > > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn > > wrote: > > > > > > > > > > > Hello Manik, > > > > > > > > > > > > this is not possible at the moment. As Parquet is a portable > > on-disk > > > > > > format, we focus on having a single representation for each data > > type. > > > > Thus > > > > > > implementing readers/writers is limited to these to make their > > > > > > implementation simpler. Especially as you are the producer but > not > > the > > > > > > consumer, even adding a new type would not solve your problem. > You > > > > really > > > > > > can only use a new logical type when it has been implemented in > > all the > > > > > > readers and your consumers have all updated to these reader > > versions. > > > > > > > > > > > > As Unicode and thus UTF-8 support all characters one can think > > off, you > > > > > > should always be able to convert strings to it. Given that > Parquet > > > > files > > > > > > encode and compress the data anyway afterwards, the conversion > is a > > > > bit of > > > > > > a CPU overhead but should not make a difference in size and form > > of the > > > > > > data actually stored in the files. Also I guess that the > > UTF-16->UTF-8 > > > > > > conversion costs less CPU that the Parquet compression process. > > > > > > > > > > > > Did this help you or is there any reason why you really cannot > > convert > > > > > > your data to UTF-8? > > > > > > > > > > > > Uwe > > > > > > > > > > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > > > > > > > Hi > > > > > > > > > > > > > > I am new to Parquet. I am trying to save UTF-16 or some other > > > > encoding > > > > > > than > > > > > > > UTF-8. > > > > > > > I am also trying to use encoding hint when saving ByteBuffer. > > > > > > > > > > > > > > I don't find way to use any thing other than UTF-8. > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > > > > > > says > > > > > > > we can extend primitive types to solve cases. > > > > > > > > > > > > > > Other thing I want to mention is I am only the producer of > > parquet > > > > file > > > > > > but > > > > > > > not consumer. > > > > > > > > > > > > > > Could you guide me which examples I can look into or which will > > be > > > > right > > > > > > way > > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > Manik Singla > > > > > > > +91-9996008893 > > > > > > > +91-9665639677 > > > > > > > > > > > > > > "Life doesn't consist in holding good cards but playing those > you > > > > hold > > > > > > > well." > > > > > > > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > -- > Ryan Blue > Software Engineer > Netflix >
Re: parquet using encoding other than UTF-8
Ok, thanks. Sorry I misread what you meant! On Wed, Feb 6, 2019 at 3:02 PM Wes McKinney wrote: > > I think the right thing is to transcode your data to UTF-8. > > I agree -- I wasn't recommending the approach I described, just to say > that it is not impossible. > > On Wed, Feb 6, 2019 at 12:10 PM Ryan Blue > wrote: > > > > I disagree with Wes. He's right that you *could* just use binary and keep > > extra metadata somewhere, it is very unlikely that Parquet would ever > > support such a scheme. And it is bad for the community when people > attempt > > to go around the format spec, as we see with the INT96 timestamp mess. > > > > I think the right thing is to transcode your data to UTF-8. That should > not > > be at all hard in any language, will create standards-compliant Parquet > > that can be used in any processing engine, and you'll save space when it > is > > stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact that > > you don't control the input data encoding doesn't mean you can't write it > > to Parquet correctly. > > > > rb > > > > On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney wrote: > > > > > hi Manik -- you are free to store the data as opaque binary as > > > BYTE_ARRAY type and add some metadata to the schema so your readers > > > can recognize that it's UTF-16 stored as binary > > > > > > On Wed, Feb 6, 2019 at 12:24 AM Manik Singla > wrote: > > > > > > > > I am not the producer of data so I can not control encoding. I do > receive > > > > ByteBuffer and encoding. > > > > I can decode data with given encoding and covert to UTF-8 for storing > > > with > > > > Parquet. > > > > I was thinking to remove that overhead if possible > > > > > > > > Regards > > > > Manik Singla > > > > +91-9996008893 > > > > +91-9665639677 > > > > > > > > "Life doesn't consist in holding good cards but playing those you > hold > > > > well." > > > > > > > > > > > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn > wrote: > > > > > > > > > Hello Manik, > > > > > > > > > > this is not possible at the moment. As Parquet is a portable > on-disk > > > > > format, we focus on having a single representation for each data > type. > > > Thus > > > > > implementing readers/writers is limited to these to make their > > > > > implementation simpler. Especially as you are the producer but not > the > > > > > consumer, even adding a new type would not solve your problem. You > > > really > > > > > can only use a new logical type when it has been implemented in > all the > > > > > readers and your consumers have all updated to these reader > versions. > > > > > > > > > > As Unicode and thus UTF-8 support all characters one can think > off, you > > > > > should always be able to convert strings to it. Given that Parquet > > > files > > > > > encode and compress the data anyway afterwards, the conversion is a > > > bit of > > > > > a CPU overhead but should not make a difference in size and form > of the > > > > > data actually stored in the files. Also I guess that the > UTF-16->UTF-8 > > > > > conversion costs less CPU that the Parquet compression process. > > > > > > > > > > Did this help you or is there any reason why you really cannot > convert > > > > > your data to UTF-8? > > > > > > > > > > Uwe > > > > > > > > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > > > > > > Hi > > > > > > > > > > > > I am new to Parquet. I am trying to save UTF-16 or some other > > > encoding > > > > > than > > > > > > UTF-8. > > > > > > I am also trying to use encoding hint when saving ByteBuffer. > > > > > > > > > > > > I don't find way to use any thing other than UTF-8. > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > > > > > says > > > > > > we can extend primitive types to solve cases. > > > > > > > > > > > > Other thing I want to mention is I am only the producer of > parquet > > > file > > > > > but > > > > > > not consumer. > > > > > > > > > > > > Could you guide me which examples I can look into or which will > be > > > right > > > > > way > > > > > > > > > > > > > > > > > > Regards > > > > > > Manik Singla > > > > > > +91-9996008893 > > > > > > +91-9665639677 > > > > > > > > > > > > "Life doesn't consist in holding good cards but playing those you > > > hold > > > > > > well." > > > > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Ryan Blue Software Engineer Netflix
Re: parquet using encoding other than UTF-8
> I think the right thing is to transcode your data to UTF-8. I agree -- I wasn't recommending the approach I described, just to say that it is not impossible. On Wed, Feb 6, 2019 at 12:10 PM Ryan Blue wrote: > > I disagree with Wes. He's right that you *could* just use binary and keep > extra metadata somewhere, it is very unlikely that Parquet would ever > support such a scheme. And it is bad for the community when people attempt > to go around the format spec, as we see with the INT96 timestamp mess. > > I think the right thing is to transcode your data to UTF-8. That should not > be at all hard in any language, will create standards-compliant Parquet > that can be used in any processing engine, and you'll save space when it is > stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact that > you don't control the input data encoding doesn't mean you can't write it > to Parquet correctly. > > rb > > On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney wrote: > > > hi Manik -- you are free to store the data as opaque binary as > > BYTE_ARRAY type and add some metadata to the schema so your readers > > can recognize that it's UTF-16 stored as binary > > > > On Wed, Feb 6, 2019 at 12:24 AM Manik Singla wrote: > > > > > > I am not the producer of data so I can not control encoding. I do receive > > > ByteBuffer and encoding. > > > I can decode data with given encoding and covert to UTF-8 for storing > > with > > > Parquet. > > > I was thinking to remove that overhead if possible > > > > > > Regards > > > Manik Singla > > > +91-9996008893 > > > +91-9665639677 > > > > > > "Life doesn't consist in holding good cards but playing those you hold > > > well." > > > > > > > > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn wrote: > > > > > > > Hello Manik, > > > > > > > > this is not possible at the moment. As Parquet is a portable on-disk > > > > format, we focus on having a single representation for each data type. > > Thus > > > > implementing readers/writers is limited to these to make their > > > > implementation simpler. Especially as you are the producer but not the > > > > consumer, even adding a new type would not solve your problem. You > > really > > > > can only use a new logical type when it has been implemented in all the > > > > readers and your consumers have all updated to these reader versions. > > > > > > > > As Unicode and thus UTF-8 support all characters one can think off, you > > > > should always be able to convert strings to it. Given that Parquet > > files > > > > encode and compress the data anyway afterwards, the conversion is a > > bit of > > > > a CPU overhead but should not make a difference in size and form of the > > > > data actually stored in the files. Also I guess that the UTF-16->UTF-8 > > > > conversion costs less CPU that the Parquet compression process. > > > > > > > > Did this help you or is there any reason why you really cannot convert > > > > your data to UTF-8? > > > > > > > > Uwe > > > > > > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > > > > > Hi > > > > > > > > > > I am new to Parquet. I am trying to save UTF-16 or some other > > encoding > > > > than > > > > > UTF-8. > > > > > I am also trying to use encoding hint when saving ByteBuffer. > > > > > > > > > > I don't find way to use any thing other than UTF-8. > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > > > > says > > > > > we can extend primitive types to solve cases. > > > > > > > > > > Other thing I want to mention is I am only the producer of parquet > > file > > > > but > > > > > not consumer. > > > > > > > > > > Could you guide me which examples I can look into or which will be > > right > > > > way > > > > > > > > > > > > > > > Regards > > > > > Manik Singla > > > > > +91-9996008893 > > > > > +91-9665639677 > > > > > > > > > > "Life doesn't consist in holding good cards but playing those you > > hold > > > > > well." > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix
Re: parquet using encoding other than UTF-8
I disagree with Wes. He's right that you *could* just use binary and keep extra metadata somewhere, it is very unlikely that Parquet would ever support such a scheme. And it is bad for the community when people attempt to go around the format spec, as we see with the INT96 timestamp mess. I think the right thing is to transcode your data to UTF-8. That should not be at all hard in any language, will create standards-compliant Parquet that can be used in any processing engine, and you'll save space when it is stored. UTF-16 to UTF-8 is not lossy, so it should be fine. The fact that you don't control the input data encoding doesn't mean you can't write it to Parquet correctly. rb On Wed, Feb 6, 2019 at 6:39 AM Wes McKinney wrote: > hi Manik -- you are free to store the data as opaque binary as > BYTE_ARRAY type and add some metadata to the schema so your readers > can recognize that it's UTF-16 stored as binary > > On Wed, Feb 6, 2019 at 12:24 AM Manik Singla wrote: > > > > I am not the producer of data so I can not control encoding. I do receive > > ByteBuffer and encoding. > > I can decode data with given encoding and covert to UTF-8 for storing > with > > Parquet. > > I was thinking to remove that overhead if possible > > > > Regards > > Manik Singla > > +91-9996008893 > > +91-9665639677 > > > > "Life doesn't consist in holding good cards but playing those you hold > > well." > > > > > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn wrote: > > > > > Hello Manik, > > > > > > this is not possible at the moment. As Parquet is a portable on-disk > > > format, we focus on having a single representation for each data type. > Thus > > > implementing readers/writers is limited to these to make their > > > implementation simpler. Especially as you are the producer but not the > > > consumer, even adding a new type would not solve your problem. You > really > > > can only use a new logical type when it has been implemented in all the > > > readers and your consumers have all updated to these reader versions. > > > > > > As Unicode and thus UTF-8 support all characters one can think off, you > > > should always be able to convert strings to it. Given that Parquet > files > > > encode and compress the data anyway afterwards, the conversion is a > bit of > > > a CPU overhead but should not make a difference in size and form of the > > > data actually stored in the files. Also I guess that the UTF-16->UTF-8 > > > conversion costs less CPU that the Parquet compression process. > > > > > > Did this help you or is there any reason why you really cannot convert > > > your data to UTF-8? > > > > > > Uwe > > > > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > > > > Hi > > > > > > > > I am new to Parquet. I am trying to save UTF-16 or some other > encoding > > > than > > > > UTF-8. > > > > I am also trying to use encoding hint when saving ByteBuffer. > > > > > > > > I don't find way to use any thing other than UTF-8. > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > > > says > > > > we can extend primitive types to solve cases. > > > > > > > > Other thing I want to mention is I am only the producer of parquet > file > > > but > > > > not consumer. > > > > > > > > Could you guide me which examples I can look into or which will be > right > > > way > > > > > > > > > > > > Regards > > > > Manik Singla > > > > +91-9996008893 > > > > +91-9665639677 > > > > > > > > "Life doesn't consist in holding good cards but playing those you > hold > > > > well." > > > > -- Ryan Blue Software Engineer Netflix
Re: parquet using encoding other than UTF-8
hi Manik -- you are free to store the data as opaque binary as BYTE_ARRAY type and add some metadata to the schema so your readers can recognize that it's UTF-16 stored as binary On Wed, Feb 6, 2019 at 12:24 AM Manik Singla wrote: > > I am not the producer of data so I can not control encoding. I do receive > ByteBuffer and encoding. > I can decode data with given encoding and covert to UTF-8 for storing with > Parquet. > I was thinking to remove that overhead if possible > > Regards > Manik Singla > +91-9996008893 > +91-9665639677 > > "Life doesn't consist in holding good cards but playing those you hold > well." > > > On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn wrote: > > > Hello Manik, > > > > this is not possible at the moment. As Parquet is a portable on-disk > > format, we focus on having a single representation for each data type. Thus > > implementing readers/writers is limited to these to make their > > implementation simpler. Especially as you are the producer but not the > > consumer, even adding a new type would not solve your problem. You really > > can only use a new logical type when it has been implemented in all the > > readers and your consumers have all updated to these reader versions. > > > > As Unicode and thus UTF-8 support all characters one can think off, you > > should always be able to convert strings to it. Given that Parquet files > > encode and compress the data anyway afterwards, the conversion is a bit of > > a CPU overhead but should not make a difference in size and form of the > > data actually stored in the files. Also I guess that the UTF-16->UTF-8 > > conversion costs less CPU that the Parquet compression process. > > > > Did this help you or is there any reason why you really cannot convert > > your data to UTF-8? > > > > Uwe > > > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > > > Hi > > > > > > I am new to Parquet. I am trying to save UTF-16 or some other encoding > > than > > > UTF-8. > > > I am also trying to use encoding hint when saving ByteBuffer. > > > > > > I don't find way to use any thing other than UTF-8. > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > > says > > > we can extend primitive types to solve cases. > > > > > > Other thing I want to mention is I am only the producer of parquet file > > but > > > not consumer. > > > > > > Could you guide me which examples I can look into or which will be right > > way > > > > > > > > > Regards > > > Manik Singla > > > +91-9996008893 > > > +91-9665639677 > > > > > > "Life doesn't consist in holding good cards but playing those you hold > > > well." > >
Re: parquet using encoding other than UTF-8
I am not the producer of data so I can not control encoding. I do receive ByteBuffer and encoding. I can decode data with given encoding and covert to UTF-8 for storing with Parquet. I was thinking to remove that overhead if possible Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn't consist in holding good cards but playing those you hold well." On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn wrote: > Hello Manik, > > this is not possible at the moment. As Parquet is a portable on-disk > format, we focus on having a single representation for each data type. Thus > implementing readers/writers is limited to these to make their > implementation simpler. Especially as you are the producer but not the > consumer, even adding a new type would not solve your problem. You really > can only use a new logical type when it has been implemented in all the > readers and your consumers have all updated to these reader versions. > > As Unicode and thus UTF-8 support all characters one can think off, you > should always be able to convert strings to it. Given that Parquet files > encode and compress the data anyway afterwards, the conversion is a bit of > a CPU overhead but should not make a difference in size and form of the > data actually stored in the files. Also I guess that the UTF-16->UTF-8 > conversion costs less CPU that the Parquet compression process. > > Did this help you or is there any reason why you really cannot convert > your data to UTF-8? > > Uwe > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > > Hi > > > > I am new to Parquet. I am trying to save UTF-16 or some other encoding > than > > UTF-8. > > I am also trying to use encoding hint when saving ByteBuffer. > > > > I don't find way to use any thing other than UTF-8. > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > says > > we can extend primitive types to solve cases. > > > > Other thing I want to mention is I am only the producer of parquet file > but > > not consumer. > > > > Could you guide me which examples I can look into or which will be right > way > > > > > > Regards > > Manik Singla > > +91-9996008893 > > +91-9665639677 > > > > "Life doesn't consist in holding good cards but playing those you hold > > well." >
Re: parquet using encoding other than UTF-8
Hello Manik, this is not possible at the moment. As Parquet is a portable on-disk format, we focus on having a single representation for each data type. Thus implementing readers/writers is limited to these to make their implementation simpler. Especially as you are the producer but not the consumer, even adding a new type would not solve your problem. You really can only use a new logical type when it has been implemented in all the readers and your consumers have all updated to these reader versions. As Unicode and thus UTF-8 support all characters one can think off, you should always be able to convert strings to it. Given that Parquet files encode and compress the data anyway afterwards, the conversion is a bit of a CPU overhead but should not make a difference in size and form of the data actually stored in the files. Also I guess that the UTF-16->UTF-8 conversion costs less CPU that the Parquet compression process. Did this help you or is there any reason why you really cannot convert your data to UTF-8? Uwe On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > Hi > > I am new to Parquet. I am trying to save UTF-16 or some other encoding than > UTF-8. > I am also trying to use encoding hint when saving ByteBuffer. > > I don't find way to use any thing other than UTF-8. > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says > we can extend primitive types to solve cases. > > Other thing I want to mention is I am only the producer of parquet file but > not consumer. > > Could you guide me which examples I can look into or which will be right way > > > Regards > Manik Singla > +91-9996008893 > +91-9665639677 > > "Life doesn't consist in holding good cards but playing those you hold > well."
parquet using encoding other than UTF-8
Hi I am new to Parquet. I am trying to save UTF-16 or some other encoding than UTF-8. I am also trying to use encoding hint when saving ByteBuffer. I don't find way to use any thing other than UTF-8. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says we can extend primitive types to solve cases. Other thing I want to mention is I am only the producer of parquet file but not consumer. Could you guide me which examples I can look into or which will be right way Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn't consist in holding good cards but playing those you hold well."