Joseph Gardi created ARROW-17469:
------------------------------------
Summary: Failure to parse files that can be parsed on pyarrow.
Also, failure to recover from crash
Key: ARROW-17469
URL: https://issues.apache.org/jira/browse/ARROW-17469
Project: Apache Arrow
Issue Type: Bug
Environment: Mac OS 11.4
go 1.17.1
github.com/apache/arrow/go/arrow v0.0.0-20211112161151-bc219186db40
Reporter: Joseph Gardi
Attachments:
part-00000-db343798-fcc1-4288-be39-3b00bed75c24.c000.snappy.parquet
I am using the following code to read parquet files in go and it works on some
parquet files:
{code:java}
import (
"github.com/apache/arrow/go/v10/arrow/memory"
"github.com/apache/arrow/go/v10/parquet/file"
"github.com/apache/arrow/go/v10/parquet/pqarrow"
...
pf, err := file.NewParquetReader(bytes.NewReader(data))
check(err)
preader, err := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{},
memory.DefaultAllocator)
check(err)
fmt.Println("before read table")
result, err := preader.ReadTable(ctx)
check(err)
fmt.Println("result is", result.NumRows())
result.Release(){code}
It works on some parquet files but not on others files that can be parse by
pyarrow's read_table function. However, even pyarrow fails to parse some
parquet files that I was able to parse with
[https://github.com/xitongsys/parquet-go.] I've attached an example of a file
that fails. When it fails I get this stack trace:
panic: runtime error: index out of range [0] with length 0
goroutine 595 [running]:
github.com/apache/arrow/go/v10/parquet/internal/utils.NewFirstTimeBitmapWriter(...)
{code:java}
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/utils/bitmap_writer.go:83
github.com/apache/arrow/go/v10/parquet/file.defLevelsToBitmapInternal({0xc001714500,
0x0, 0x1000}, {0x881b680, 0x0, 0x100, 0x2648}, 0xc001583880, 0x40)
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:173
+0x23b
github.com/apache/arrow/go/v10/parquet/file.DefLevelsToBitmap(...)
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:186
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc00049dec0,
0x0)
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:279
+0x1b3
github.com/apache/arrow/go/v10/parquet/pqarrow.(*listReader).BuildArray(0xc000a31ac0,
0xbd3)
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:391
+0x4a2
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc000418f60,
0xbd3)
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:289
+0x534
github.com/apache/arrow/go/v10/parquet/pqarrow.(*ColumnReader).NextBatch(0xc00051c330,
0x0)
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:134
+0x5c
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadColumn(0xc001583f88,
{0xc0008f72b0, 0xc000bda3f0, 0x0}, 0xc000a316c0)
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:247
+0x65
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups.func1()
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:341
+0xd2
created by
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:332
+0x3d2{code}
There is always some chance that my application will encounter a bad parquet
file so I'd like to be able to recover from this panic. However, that doesn't
work easily because this stack trace is coming from a different goroutine which
is created on line 332 of ffile_reader.go:ReadRowGroups.
So it seems that the solution is to do a recover within that goroutine and then
try a different prarser such as
[xitongsys|https://github.com/xitongsys/parquet-go.]/go-parquet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)