Re: Writing empty strings to parquet files

scarrascoso Mon, 07 May 2018 03:23:11 -0700

Hi Wes:

Thanks for your message.


I would say that both test_pandas_parquet_1_0_rountrip and 
test_pandas_parquet_2_0_rountrip (in 
arrow/python/pyarrow/tests/test_parquet.py) already test this.
Sorry I didn’t realize this sooner.

All the best,

Sergio Carrascoso

> On 5 May 2018, at 01:31, Wes McKinney <[email protected]> wrote:
> 
> Thanks Sergio. If we don't have any unit tests explicitly testing
> this, it would be a good idea to add some anyway.
> 
> - Wes
> 
> On Fri, May 4, 2018 at 12:26 PM,  <[email protected]> wrote:
>> Hi Uwe:
>> 
>> Thanks a lot for your feedback.
>> 
>> While preparing a simple example to reproduce this issue, I have been able 
>> to get the expected behavior (empty strings properly written as ‘’ in the 
>> parquet file).
>> So actually there’s no problem with the Parquet.write_table
>> 
>> The problem was rather in a bug whereas two steps in my process were in the 
>> wrong order, so None values were being applied unicode formatting earlier 
>> than expected, thus becoming ‘None’.
>> 
>> Again, thank you very much and apologies for the noise.
>> 
>> Best,
>> 
>> Sergio Carrascoso
>> 
>>> On 4 May 2018, at 10:54, Uwe L. Korn <[email protected]> wrote:
>>> 
>>> Hello Sergio,
>>> 
>>> this is definitely unwanted behaviour. Can you open an issue on 
>>> https://issues.apache.org/jira/projects/PARQUET and provide a minimal 
>>> reproducing example. There is definitely a difference between empty strings 
>>> and null strings. Parquet also supports the differentiation thus we should 
>>> support roundtripping them.
>>> 
>>> Uwe
>>> 
>>> On Thu, May 3, 2018, at 8:47 AM, [email protected] wrote:
>>>> 
>>>> Hi:
>>>> 
>>>> I would like to know if there is any way in PyArrow to write empty
>>>> string values to a parquet file.
>>>> When I use Parquet.write_table, if any column contains empty string
>>>> values, they end up as None in the parquet file.
>>>> My process depends on these values to be properly written as empty
>>>> strings in the parquet files.
>>>> 
>>>> To provide some context, my current worflow is the following:
>>>> 
>>>> - Read content from json files (using Pandas.read_json)
>>>> - Convert the corresponding dataframe to a PyArrow table (using
>>>> PyArrow.Table.from_pandas)
>>>> - Finally, write the table to a parquet file (using Parquet.write_table)
>>>> 
>>>> I have done some checks during the process, and the empty string values
>>>> are being honored until the writing step to a parquet file.
>>>> 
>>>> The options for the write_table method don't provide any specific for
>>>> this, is this behavior (write '' as None) an unavoidable default?
>>>> Is there any other way to write the parquet files where I have more
>>>> options to deal with this?
>>>> 
>>>> Any hint or feedback will be greatly appreciated.
>>>> 
>>>> Thanks a lot in advance, all the best.
>>>> 
>>>> Sergio Carrascoso
>>>> 
>>

Re: Writing empty strings to parquet files

Reply via email to