[
https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720638#comment-16720638
]
David Lee commented on ARROW-3907:
----------------------------------
Yeah i'm trying to figure out what the best way to preserve INTs when
converting json to parquet..
The problem is more or less summarized here.
[https://pandas.pydata.org/pandas-docs/stable/gotchas.html]
There are a lot of gotchas with each step.
json.loads() works fine.
pandas.DataFrame() is a problem if every record doesn't contain the same
columns.
Using pandas.DataFrame.reindex() to add missing columns adds a bunch of NaN
values.
Adding NaN values will force change a column's dtype from INT64 to FLOAT64.
NaNs are a problem to begin with because if you convert it to Parquet you end
up with Zeros instead of Nulls.
Running pandas.DataFrame.reindex(fill_value=None) doesn't work because passing
in None is equal to pandas.DataFrame.reindex() without any params.
Only way to replace NaNs with None is with pandas.DataFrame.where().
After replacing NaNs you can then change the dtype of the column from FLOAT64
back to INT64
It's basically a lot of hoops to go through to preserve your original JSON INT
as a Parquet INT.
Maybe the best solution is to create a pyarrow.Table.from_pydict() function to
create a arrow table from a python dictionary. We have this gap with
pyarrow.Table.to_pydict(), pyarrow.Table.to_pandas() and
pyarrow.Table.from_pandas().
> [Python] from_pandas errors when schemas are used with lower resolution
> timestamps
> ----------------------------------------------------------------------------------
>
> Key: ARROW-3907
> URL: https://issues.apache.org/jira/browse/ARROW-3907
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.11.1
> Reporter: David Lee
> Priority: Major
> Fix For: 0.11.1
>
>
> When passing in a schema object to from_pandas a resolution error occurs if
> the schema uses a lower resolution timestamp. Do we need to also add
> "coerce_timestamps" and "allow_truncated_timestamps" parameters found in
> write_table() to from_pandas()?
> Error:
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would
> lose data: 1532015191753713000', 'Conversion failed for column modified with
> type datetime64[ns]')
> Code:
>
> {code:java}
> processed_schema = pa.schema([
> pa.field('Id', pa.string()),
> pa.field('modified', pa.timestamp('ms')),
> pa.field('records', pa.int32())
> ])
> pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)