Ryan Stalets created ARROW-13318:
------------------------------------
Summary: kMaxParserNumRows Value Increase/Removal
Key: ARROW-13318
URL: https://issues.apache.org/jira/browse/ARROW-13318
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Reporter: Ryan Stalets
I'm a new pyArrow user and have been investigating occasional errors related to
the Python exception: "ArrowInvalid: Exceeded maximum rows" when parsing JSON
line files using pyarrow.json.read_json(). In digging in, it looks like the
original source of this exception is in cpp/src/arrow/json/parser.cc on line
703, which appears to throw the error when the number of lines processed
exceeds kMaxParserNumRows.
{code:java}
for (; num_rows_ < kMaxParserNumRows; ++num_rows_) {
auto ok = reader.Parse<parse_flags>(json, handler);
switch (ok.Code()) {
case rj::kParseErrorNone:
// parse the next object
continue;
case rj::kParseErrorDocumentEmpty:
// parsed all objects, finish
return Status::OK();
case rj::kParseErrorTermination:
// handler emitted an error
return handler.Error();
default:
// rj emitted an error
return ParseError(rj::GetParseError_En(ok.Code()), " in row ",
num_rows_);
}
}
return Status::Invalid("Exceeded maximum rows");
}{code}
This constant appears to be set in arrow/json/parser.h on line 53, and has been
set this way since that file's initial commit.
{code:java}
constexpr int32_t kMaxParserNumRows = 100000;{code}
There does not appear to be a comment in the code or in the commit or PR
explaining this maximum number of lines.
I'm wondering what the reason for this maximum might be, and if it might be
removed, increased, or made overridable in the C++ and the upstream Python. It
is common to need to process JSON files of arbitrary length (logs from
applications, third-party vendors, etc) where the user of the data does not
have control over the size of the file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)