[
https://issues.apache.org/jira/browse/ARROW-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507723#comment-17507723
]
Matthew Topol commented on ARROW-15544:
---------------------------------------
Hey [~antoinegelloz] and welcome!
So, the usage of `RawStdEncoding`, if I'm remembering correctly from when I
implemented it, was because that was the way that the schema was encoded in a
few of the test parquet files. That being said, it's a reasonable change to
improve the pqarrow library to try using standard if the RawStdEncoding fails.
I do have a question, can Pandas read the Parquet file written by pqarrow that
is using RawStdEncoding? or does Pandas fail because it expects the padding
characters?
Looking at the current C++ code, I think that the existing base64 encoding code
does indeed pad the characters and this might just be an oversight on my end
that writing should be done with Standard, and reading should try both. When I
get a chance I'll try putting a PR up with a fix.
> [Go][Parquet] pqarrow.getOriginSchema error while decoding ARROW:schema
> -----------------------------------------------------------------------
>
> Key: ARROW-15544
> URL: https://issues.apache.org/jira/browse/ARROW-15544
> Project: Apache Arrow
> Issue Type: Bug
> Components: Go, Parquet
> Affects Versions: 7.0.0
> Environment: go1.17, python3.8
> Reporter: Antoine Gelloz
> Priority: Minor
>
> Hello !
> This is my first time participating in the open source community as a junior
> developer and I would like to thank you all for your hard work :)
> While using the new pqarrow package for our project
> [Metronlab/bow|https://github.com/Metronlab/bow] to read parquet files
> previously written by Pandas.
> An error is returned by function getOriginSchema if the "ARROW:schema" base64
> encoded value is ending with padding characters.
> This is caused by the use of the
> [RawStdEncoding|https://pkg.go.dev/encoding/base64#pkg-variables] type that
> omits padding characters.
> Is there any reason for using raw encoding instead of standard?
> Here is a repo with a test script to demonstrate the problem:
> [antoinegelloz/arrowparquet|https://github.com/antoinegelloz/arrowparquet]
> Thank you in advance for your help,
> Antoine Gelloz
--
This message was sent by Atlassian Jira
(v8.20.1#820001)