cheolgook opened a new issue, #37509:
URL: https://github.com/apache/arrow/issues/37509
### Describe the bug, including details regarding any error messages,
version, and platform.
I have a python3 script reading from JSON log files and storing them in
parquet with pyarrow schema. Recently ran into this weird case with the error
in title, `ArrowTypeError: "object of type <class 'str'> cannot be converted to
int"`.
Here's the log file:
```json, pa_bad.js
{"uuid":"D4D8806A-4794-11EE-BC2C-31A44E74A5E9","timestamp":"2023-08-30T00:21:12+0000","version":"0.1","event_data":{"amount":27.1048,"user":19323160,"timestamp":"2023-08-30T00:21:12+0000"}}
```
Here's my script:
```python3, pa.py
#!/usr/bin/python3
import pandas as pd
import pyarrow as pa
SCHEMA = pa.schema([
('uuid', pa.string()),
('timestamp', pa.timestamp('s', tz='UTC')),
('version', pa.float64()),
('event_data', pa.struct([
('amount', pa.float64()),
('user', pa.int64()),
('timestamp', pa.timestamp('s', tz='UTC')),
])),
])
def main():
df = pd.read_json('pa_bad.log', lines=True)
df.to_parquet(path='pa.parquet', schema=SCHEMA)
if __name__ == '__main__':
main()
```
and the error looks like this:
```shell
$ python3 pa.py
Traceback (most recent call last):
File "pa.py", line 24, in <module>
main()
File "pa.py", line 20, in main
df.to_parquet(path='pa.parquet', schema=SCHEMA)
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/util/_decorators.py",
line 199, in wrapper
return func(*args, **kwargs)
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/core/frame.py",
line 2463, in to_parquet
**kwargs,
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/io/parquet.py",
line 397, in to_parquet
**kwargs,
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/io/parquet.py",
line 152, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 3681, in pyarrow.lib.Table.from_pandas
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
line 612, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
line 612, in <listcomp>
for c, f in zip(columns_to_convert, convert_fields)]
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
line 598, in convert_column
raise e
File
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
line 592, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ("object of type <class 'str'> cannot be
converted to int", 'Conversion failed for column event_data with type object')
```
I was digging into the JSON file and found that it works if I remove the
event_data.timestamp from the JSON log file:
```json, pa_good.log
{"uuid":"D4D8806A-4794-11EE-BC2C-31A44E74A5E9","timestamp":"2023-08-30T00:21:12+0000","version":"0.1","event_data":{"amount":27.1048,"user":19323160}}
```
Please note that the schema entry for the `event_data.timestamp` in the
`pa.py` has been removed as well.
I tried these variaions of JSON input and they all worked with no problem.
```json
{
"uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9",
"timestamp": "2023-08-30T00:21:12+0000",
"version": "0.1",
"event_data": {
"amount": 27.1048,
"user": 19323160,
"uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9" <-------- works
}
}
```
```json
{
"uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9",
"timestamp": "2023-08-30T00:21:12+0000",
"version": "0.1",
"event_data": {
"amount": 27.1048,
"user": 19323160,
"version": "0.1" <-------- works
}
}
```
Problem happens only when there is a timestamp value in a nested object:
```json
{
"uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9",
"version": "0.1",
"event": {
"amount": 27.1048,
"user": 19323160,
"ts": "2023-08-30T00:21:12+0000" <----------- DOES NOT WORK
}
}
```
Tested on these systems.
```shell
# macOS
$ pyenv local
3.11.1
$ pip show pandas pyarrow
Name: pandas
Version: 1.5.1
Summary: Powerful data structures for data analysis, time series, and
statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /Users/user/.pyenv/versions/3.11.1/lib/python3.11/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: parquet-tools
---
Name: pyarrow
Version: 13.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author:
Author-email:
License: Apache License, Version 2.0
Location: /Users/user/.pyenv/versions/3.11.1/lib/python3.11/site-packages
Requires: numpy
Required-by: parquet-tools
```
```shell
# CentOS 7
$ pyenv local
3.7.1
$ pip show pandas pyarrow
Name: pandas
Version: 1.2.0
Summary: Powerful data structures for data analysis, time series, and
statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: /home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages
Requires: python-dateutil, pytz, numpy
Required-by:
---
Name: pyarrow
Version: 12.0.1
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: None
Author-email: None
License: Apache License, Version 2.0
Location: /home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages
Requires: numpy
Required-by:
```
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]