[ 
https://issues.apache.org/jira/browse/ARROW-17469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Gardi updated ARROW-17469:
---------------------------------
    Description: 
I am using the following code to read parquet files in go and it works on some 
parquet files:
{code:java}
import (
"github.com/apache/arrow/go/v10/arrow/memory"
"github.com/apache/arrow/go/v10/parquet/file"
"github.com/apache/arrow/go/v10/parquet/pqarrow"
...

pf, err := file.NewParquetReader(bytes.NewReader(data))
check(err)
preader, err := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, 
memory.DefaultAllocator)
check(err)
fmt.Println("before read table")
result, err := preader.ReadTable(ctx)
check(err)
fmt.Println("result is", result.NumRows())
result.Release(){code}
It works on some parquet files but not on others files that can be parse by 
pyarrow's read_table function. However, even pyarrow fails to parse some 
parquet files that I was able to parse with 
[https://github.com/xitongsys/parquet-go.] I've attached an example of a file 
that fails. When it fails I get this stack trace:

panic: runtime error: index out of range [0] with length 0

goroutine 595 [running]:
github.com/apache/arrow/go/v10/parquet/internal/utils.NewFirstTimeBitmapWriter(...)
    
{code:java}
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/utils/bitmap_writer.go:83
github.com/apache/arrow/go/v10/parquet/file.defLevelsToBitmapInternal({0xc001714500,
 0x0, 0x1000}, {0x881b680, 0x0, 0x100, 0x2648}, 0xc001583880, 0x40)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:173
 +0x23b
github.com/apache/arrow/go/v10/parquet/file.DefLevelsToBitmap(...)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:186
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc00049dec0,
 0x0)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:279
 +0x1b3
github.com/apache/arrow/go/v10/parquet/pqarrow.(*listReader).BuildArray(0xc000a31ac0,
 0xbd3)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:391
 +0x4a2
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc000418f60,
 0xbd3)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:289
 +0x534
github.com/apache/arrow/go/v10/parquet/pqarrow.(*ColumnReader).NextBatch(0xc00051c330,
 0x0)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:134
 +0x5c
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadColumn(0xc001583f88,
 {0xc0008f72b0, 0xc000bda3f0, 0x0}, 0xc000a316c0)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:247
 +0x65
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups.func1()
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:341
 +0xd2
created by 
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:332
 +0x3d2{code}
 

There is always some chance that my application will encounter a bad parquet 
file so I'd like to be able to recover from this panic. However, that doesn't 
work easily because this stack trace is coming from a different goroutine which 
is created on line 332 of ffile_reader.go:ReadRowGroups. 

So it seems that the solution is to do a recover within that goroutine and then 
try a different prarser such as 
[xitongsys|https://github.com/xitongsys/parquet-go.]/go-parquet.

 

On an unrelated note, I'm wonder why is the go implementation of read_table so 
much faster than the python one if they are both calling c++?

  was:
I am using the following code to read parquet files in go and it works on some 
parquet files:
{code:java}
import (
"github.com/apache/arrow/go/v10/arrow/memory"
"github.com/apache/arrow/go/v10/parquet/file"
"github.com/apache/arrow/go/v10/parquet/pqarrow"
...

pf, err := file.NewParquetReader(bytes.NewReader(data))
check(err)
preader, err := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, 
memory.DefaultAllocator)
check(err)
fmt.Println("before read table")
result, err := preader.ReadTable(ctx)
check(err)
fmt.Println("result is", result.NumRows())
result.Release(){code}
It works on some parquet files but not on others files that can be parse by 
pyarrow's read_table function. However, even pyarrow fails to parse some 
parquet files that I was able to parse with 
[https://github.com/xitongsys/parquet-go.] I've attached an example of a file 
that fails. When it fails I get this stack trace:

panic: runtime error: index out of range [0] with length 0

goroutine 595 [running]:
github.com/apache/arrow/go/v10/parquet/internal/utils.NewFirstTimeBitmapWriter(...)
    
{code:java}
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/utils/bitmap_writer.go:83
github.com/apache/arrow/go/v10/parquet/file.defLevelsToBitmapInternal({0xc001714500,
 0x0, 0x1000}, {0x881b680, 0x0, 0x100, 0x2648}, 0xc001583880, 0x40)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:173
 +0x23b
github.com/apache/arrow/go/v10/parquet/file.DefLevelsToBitmap(...)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:186
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc00049dec0,
 0x0)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:279
 +0x1b3
github.com/apache/arrow/go/v10/parquet/pqarrow.(*listReader).BuildArray(0xc000a31ac0,
 0xbd3)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:391
 +0x4a2
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc000418f60,
 0xbd3)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:289
 +0x534
github.com/apache/arrow/go/v10/parquet/pqarrow.(*ColumnReader).NextBatch(0xc00051c330,
 0x0)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:134
 +0x5c
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadColumn(0xc001583f88,
 {0xc0008f72b0, 0xc000bda3f0, 0x0}, 0xc000a316c0)
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:247
 +0x65
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups.func1()
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:341
 +0xd2
created by 
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups
    
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:332
 +0x3d2{code}
 

There is always some chance that my application will encounter a bad parquet 
file so I'd like to be able to recover from this panic. However, that doesn't 
work easily because this stack trace is coming from a different goroutine which 
is created on line 332 of ffile_reader.go:ReadRowGroups. 

So it seems that the solution is to do a recover within that goroutine and then 
try a different prarser such as 
[xitongsys|https://github.com/xitongsys/parquet-go.]/go-parquet.


> Failure to parse files that can be parsed on pyarrow. Also, failure to 
> recover from crash
> -----------------------------------------------------------------------------------------
>
>                 Key: ARROW-17469
>                 URL: https://issues.apache.org/jira/browse/ARROW-17469
>             Project: Apache Arrow
>          Issue Type: Bug
>         Environment: Mac OS 11.4
> go 1.17.1
> github.com/apache/arrow/go/arrow v0.0.0-20211112161151-bc219186db40
>            Reporter: Joseph Gardi
>            Priority: Major
>         Attachments: 
> part-00000-db343798-fcc1-4288-be39-3b00bed75c24.c000.snappy.parquet
>
>
> I am using the following code to read parquet files in go and it works on 
> some parquet files:
> {code:java}
> import (
> "github.com/apache/arrow/go/v10/arrow/memory"
> "github.com/apache/arrow/go/v10/parquet/file"
> "github.com/apache/arrow/go/v10/parquet/pqarrow"
> ...
> pf, err := file.NewParquetReader(bytes.NewReader(data))
> check(err)
> preader, err := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, 
> memory.DefaultAllocator)
> check(err)
> fmt.Println("before read table")
> result, err := preader.ReadTable(ctx)
> check(err)
> fmt.Println("result is", result.NumRows())
> result.Release(){code}
> It works on some parquet files but not on others files that can be parse by 
> pyarrow's read_table function. However, even pyarrow fails to parse some 
> parquet files that I was able to parse with 
> [https://github.com/xitongsys/parquet-go.] I've attached an example of a file 
> that fails. When it fails I get this stack trace:
> panic: runtime error: index out of range [0] with length 0
> goroutine 595 [running]:
> github.com/apache/arrow/go/v10/parquet/internal/utils.NewFirstTimeBitmapWriter(...)
>     
> {code:java}
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/utils/bitmap_writer.go:83
> github.com/apache/arrow/go/v10/parquet/file.defLevelsToBitmapInternal({0xc001714500,
>  0x0, 0x1000}, {0x881b680, 0x0, 0x100, 0x2648}, 0xc001583880, 0x40)
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:173
>  +0x23b
> github.com/apache/arrow/go/v10/parquet/file.DefLevelsToBitmap(...)
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/level_conversion.go:186
> github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc00049dec0,
>  0x0)
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:279
>  +0x1b3
> github.com/apache/arrow/go/v10/parquet/pqarrow.(*listReader).BuildArray(0xc000a31ac0,
>  0xbd3)
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:391
>  +0x4a2
> github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc000418f60,
>  0xbd3)
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/column_readers.go:289
>  +0x534
> github.com/apache/arrow/go/v10/parquet/pqarrow.(*ColumnReader).NextBatch(0xc00051c330,
>  0x0)
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:134
>  +0x5c
> github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadColumn(0xc001583f88,
>  {0xc0008f72b0, 0xc000bda3f0, 0x0}, 0xc000a316c0)
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:247
>  +0x65
> github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups.func1()
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:341
>  +0xd2
> created by 
> github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups
>     
> /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_reader.go:332
>  +0x3d2{code}
>  
> There is always some chance that my application will encounter a bad parquet 
> file so I'd like to be able to recover from this panic. However, that doesn't 
> work easily because this stack trace is coming from a different goroutine 
> which is created on line 332 of ffile_reader.go:ReadRowGroups. 
> So it seems that the solution is to do a recover within that goroutine and 
> then try a different prarser such as 
> [xitongsys|https://github.com/xitongsys/parquet-go.]/go-parquet.
>  
> On an unrelated note, I'm wonder why is the go implementation of read_table 
> so much faster than the python one if they are both calling c++?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to