Hi Wes: Thanks for your message.
I would say that both test_pandas_parquet_1_0_rountrip and test_pandas_parquet_2_0_rountrip (in arrow/python/pyarrow/tests/test_parquet.py) already test this. Sorry I didn’t realize this sooner. All the best, Sergio Carrascoso > On 5 May 2018, at 01:31, Wes McKinney <wesmck...@gmail.com> wrote: > > Thanks Sergio. If we don't have any unit tests explicitly testing > this, it would be a good idea to add some anyway. > > - Wes > > On Fri, May 4, 2018 at 12:26 PM, <scarrasc...@ravenpack.com> wrote: >> Hi Uwe: >> >> Thanks a lot for your feedback. >> >> While preparing a simple example to reproduce this issue, I have been able >> to get the expected behavior (empty strings properly written as ‘’ in the >> parquet file). >> So actually there’s no problem with the Parquet.write_table >> >> The problem was rather in a bug whereas two steps in my process were in the >> wrong order, so None values were being applied unicode formatting earlier >> than expected, thus becoming ‘None’. >> >> Again, thank you very much and apologies for the noise. >> >> Best, >> >> Sergio Carrascoso >> >>> On 4 May 2018, at 10:54, Uwe L. Korn <uw...@xhochy.com> wrote: >>> >>> Hello Sergio, >>> >>> this is definitely unwanted behaviour. Can you open an issue on >>> https://issues.apache.org/jira/projects/PARQUET and provide a minimal >>> reproducing example. There is definitely a difference between empty strings >>> and null strings. Parquet also supports the differentiation thus we should >>> support roundtripping them. >>> >>> Uwe >>> >>> On Thu, May 3, 2018, at 8:47 AM, scarrasc...@ravenpack.com wrote: >>>> >>>> Hi: >>>> >>>> I would like to know if there is any way in PyArrow to write empty >>>> string values to a parquet file. >>>> When I use Parquet.write_table, if any column contains empty string >>>> values, they end up as None in the parquet file. >>>> My process depends on these values to be properly written as empty >>>> strings in the parquet files. >>>> >>>> To provide some context, my current worflow is the following: >>>> >>>> - Read content from json files (using Pandas.read_json) >>>> - Convert the corresponding dataframe to a PyArrow table (using >>>> PyArrow.Table.from_pandas) >>>> - Finally, write the table to a parquet file (using Parquet.write_table) >>>> >>>> I have done some checks during the process, and the empty string values >>>> are being honored until the writing step to a parquet file. >>>> >>>> The options for the write_table method don't provide any specific for >>>> this, is this behavior (write '' as None) an unavoidable default? >>>> Is there any other way to write the parquet files where I have more >>>> options to deal with this? >>>> >>>> Any hint or feedback will be greatly appreciated. >>>> >>>> Thanks a lot in advance, all the best. >>>> >>>> Sergio Carrascoso >>>> >>