[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

David Lee (JIRA) Thu, 13 Dec 2018 13:55:36 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720638#comment-16720638
 ]


David Lee commented on ARROW-3907:
----------------------------------

Yeah i'm trying to figure out what the best way to preserve INTs when 
converting json to parquet..

The problem is more or less summarized here.
[https://pandas.pydata.org/pandas-docs/stable/gotchas.html]

There are a lot of gotchas with each step.

json.loads() works fine.

pandas.DataFrame() is a problem if every record doesn't contain the same 
columns.

Using pandas.DataFrame.reindex() to add missing columns adds a bunch of NaN 
values.

Adding NaN values will force change a column's dtype from INT64 to FLOAT64.

NaNs are a problem to begin with because if you convert it to Parquet you end 
up with Zeros instead of Nulls.

Running pandas.DataFrame.reindex(fill_value=None) doesn't work because passing 
in None is equal to pandas.DataFrame.reindex() without any params.

Only way to replace NaNs with None is with pandas.DataFrame.where().

After replacing NaNs you can then change the dtype of the column from FLOAT64 
back to INT64

It's basically a lot of hoops to go through to preserve your original JSON INT 
as a Parquet INT.

Maybe the best solution is to create a pyarrow.Table.from_pydict() function to 
create a arrow table from a python dictionary. We have this gap with 
pyarrow.Table.to_pydict(), pyarrow.Table.to_pandas() and 
pyarrow.Table.from_pandas().

> [Python] from_pandas errors when schemas are used with lower resolution 
> timestamps
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-3907
>                 URL: https://issues.apache.org/jira/browse/ARROW-3907
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1
>            Reporter: David Lee
>            Priority: Major
>             Fix For: 0.11.1
>
>
> When passing in a schema object to from_pandas a resolution error occurs if 
> the schema uses a lower resolution timestamp. Do we need to also add 
> "coerce_timestamps" and "allow_truncated_timestamps" parameters found in 
> write_table() to from_pandas()?
> Error:
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
> lose data: 1532015191753713000', 'Conversion failed for column modified with 
> type datetime64[ns]')
> Code:
>  
> {code:java}
> processed_schema = pa.schema([
> pa.field('Id', pa.string()),
> pa.field('modified', pa.timestamp('ms')),
> pa.field('records', pa.int32())
> ])
> pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

Reply via email to