[
https://issues.apache.org/jira/browse/ARROW-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519565#comment-17519565
]
Jacob Wujciak-Jens commented on ARROW-13314:
--------------------------------------------
The same exception still happen in pyarrow 7.0.0
> [Python] JSON parsing segment fault on long records (block_size) dependent
> --------------------------------------------------------------------------
>
> Key: ARROW-13314
> URL: https://issues.apache.org/jira/browse/ARROW-13314
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Guido Muscioni
> Priority: Major
>
> Hello,
>
> I have a big JSON file (~300MB) with complex records (nested json, nested
> lists of jsons). When I try to read this with pyarrow I am getting a
> segmentation fault. I tried then couple of things from read options, please
> see the code below (I developed this code on an example file that was
> attached here: https://issues.apache.org/jira/browse/ARROW-9612):
>
> {code:python}
> from pyarrow import json
> from pyarrow.json import ReadOptions
> import tqdm
> if __name__ == '__main__':
> source = 'wiki_04.jsonl'
> ro = ReadOptions(block_size=2**20)
> with open(source, 'r') as file:
> for i, line in tqdm.tqdm(enumerate(file)):
> with open('temp_file_arrow_3.ndjson', 'a') as file2:
> file2.write(line)
> json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
> {code}
> For both the example file and my file, this code will return the straddling
> object exception (or seg fault) once the file reach the block_size.
> Increasing the block_size will make the code fail later.
> Then I tried, on my file, to put an explicit schema:
> {code:python}
> from pyarrow import json
> from pyarrow.json import ReadOptions
> import pandas as pd
> if __name__ == '__main__':
> source = 'my_file.jsonl'
> df = pd.read_json(source, lines=True)
> table_schema = pa.Table.from_pandas(df).schema
>
> ro = ReadOptions(explicit_schema = table_schema)
> table = json.read_json(source, read_options=ro)
> {code}
> This works, which may suggest that this issue, and the issue of the linked
> JIRA issue, are only appearing when an explicit schema is not provided.
> Additionally the following code works as well:
> {code:python}
> from pyarrow import json
> from pyarrow.json import ReadOptions
> import pandas as pd
> if __name__ == '__main__':
> source = 'my_file.jsonl'
>
> ro = ReadOptions(block_size = 2**30)
> table = json.read_json(source, read_options=ro)
> {code}
> The block_size is bigger than my file in this case. Is it possible that the
> schema is defined in the first block and then if the schema changes, I get a
> seg fault?
> I cannot share my json file, however, I hope that someone could add some
> clarity on what I am seeing and maybe suggest a workaround.
> Thank you,
> Guido
--
This message was sent by Atlassian Jira
(v8.20.1#820001)