Re: Fix for bug in parquet stream writer
Hello, I contributed the parquet StreamWriter class and I can tell you that the type checking is not legacy logic and removing it is not the solution. As you have suggested the solution is to add an output operator to add support for the type and this can be done by deriving from the StreamWriter class. Ideally though all basic C++ and parquet types should be supported. Users should only have to add output operators for custom types. So I think a PR to add missing support for any parquet types would be nice. Cheers, Gawain On Mon, 2020-12-28 at 17:34 +0100, anders johansson wrote: > Hi, > > Not sure if my fix is a correct or permanent solution, depending on your > plan regarding backwards compatibility, etc. Someone with a deeper > understanding of how the code works or is supposed to work should probably > have a look at it. > > A temporary workaround is to inherit from StreamWriter and define your own > ostream overloads for logical types that are not supported. Example for > date 32: > > class StreamWriterEx : public StreamWriter { > > void WriteDate32Raw(int32_t d) { > CheckColumn(Type::INT32, ConvertedType::DATE); > Write(d); > } > }; > > > To determine the expected ConvertedType, run the code once and look at the > error message before inheriting, or just do without the CheckColumn() call. > > //A > > > On Mon, Dec 28, 2020 at 4:55 PM Wes McKinney wrote: > > > hi Anders, would you like to open a Jira issue and submit a PR (with > > unit test)? > > > > On Mon, Dec 28, 2020 at 9:51 AM anders johansson > > wrote: > > > > > > Hi, > > > > > > When writing to a primitive node of a logical type not supported by > > > converted_type (such as parquet::LogicalType::TimeUnit::NANOS), the error > > > "Column converted type mismatch" is thrown. As I understand it, the > > > converted_type logic is legacy. The problem is solved by removing > > > > > > if (converted_type != node->converted_type()) { > > > throw ParquetException("Column converted type mismatch. Column '" + > > > node->name() + > > >"' has converted type[" + > > >ConvertedTypeToString(node->converted_type()) > > + > > > "] not '" + > > >ConvertedTypeToString(converted_type) + "'"); > > > } > > > > > > from StreamWriter::CheckColumn() in src/parquet/stream_writer.cc > > > > > > BR, > > > //Anders > >
[jira] [Created] (ARROW-7745) [Doc] [C++] Update Parquet documentation
Gawain BOLTON created ARROW-7745: Summary: [Doc] [C++] Update Parquet documentation Key: ARROW-7745 URL: https://issues.apache.org/jira/browse/ARROW-7745 Project: Apache Arrow Issue Type: Improvement Reporter: Gawain BOLTON Assignee: Gawain BOLTON -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Release Apache Arrow 0.16.0 - RC1
Hello, It would seem that the list of issues does not include any of the issues in the Parquet project which were fixed in this release. Cheers, Gawain On 28/01/2020 11:46, Krisztián Szűcs wrote: Sorry, the previous email is hardly readable. I would like to propose the following release candidate (RC1) of Apache Arrow version 0.16.0. This is a release consisting of 710 resolved JIRA issues[1]. This release candidate is based on commit: 188afde1f4298fb668e8ebadeacbc545e2de086f [2] The source release rc1 is hosted at [3]. The binary artifacts are hosted at [4][5][6][7]. The changelog is located at [8]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. See [9] for how to validate a release candidate. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow 0.16.0 [ ] +0 [ ] -1 Do not release this as Apache Arrow 0.16.0 because... [1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.16.0 [2]: https://github.com/apache/arrow/tree/188afde1f4298fb668e8ebadeacbc545e2de086f [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.16.0-rc1 [4]: https://bintray.com/apache/arrow/centos-rc/0.16.0-rc1 [5]: https://bintray.com/apache/arrow/debian-rc/0.16.0-rc1 [6]: https://bintray.com/apache/arrow/python-rc/0.16.0-rc1 [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.16.0-rc1 [8]: https://github.com/apache/arrow/blob/188afde1f4298fb668e8ebadeacbc545e2de086f/CHANGELOG.md [9]: https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates On Tue, Jan 28, 2020 at 11:43 AM Krisztián Szűcs wrote: Hi, I would like to propose the following release candidate (RC1) of Apache Arrow version 0.16.0. This is a release consisting of 710 resolved JIRA issues[1]. This release candidate is based on commit: 188afde1f4298fb668e8ebadeacbc545e2de086f [2] The source release rc1 is hosted at [3]. The binary artifacts are hosted at [4][5][6][7]. The changelog is located at [8]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. See [9] for how to validate a release candidate. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow 0.16.0 [ ] +0 [ ] -1 Do not release this as Apache Arrow 0.16.0 because... [1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.16.0 [2]: https://github.com/apache/arrow/tree/188afde1f4298fb668e8ebadeacbc545e2de086f [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.16.0-rc1 [4]: https://bintray.com/apache/arrow/centos-rc/0.16.0-rc1 [5]: https://bintray.com/apache/arrow/debian-rc/0.16.0-rc1 [6]: https://bintray.com/apache/arrow/python-rc/0.16.0-rc1 [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.16.0-rc1 [8]: https://github.com/apache/arrow/blob/188afde1f4298fb668e8ebadeacbc545e2de086f/CHANGELOG.md [9]: https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[jira] [Created] (ARROW-7294) [Python] converted_type_name_from_enum(): Incorrect name for INT_64
Gawain BOLTON created ARROW-7294: Summary: [Python] converted_type_name_from_enum(): Incorrect name for INT_64 Key: ARROW-7294 URL: https://issues.apache.org/jira/browse/ARROW-7294 Project: Apache Arrow Issue Type: Bug Reporter: Gawain BOLTON Assignee: Gawain BOLTON The INT_64 type is converted to "UINT_64" It should be converted to "INT_64" -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [C++][Parquet]: Stream API handling of optional fields
Thanks for your reply. If I understand correctly ARROW-7178 must be done so that Arrow has a version of std::optional which Parquet could then use. I think I will submit a PR for this shortly. Gawain On 15/11/2019 14:05, Francois Saint-Jacques wrote: I'm all for it. Created [1] it would also enable an operator[] for arrays of primitive types [2]. [1] https://issues.apache.org/jira/browse/ARROW-7178 [2] https://issues.apache.org/jira/browse/ARROW-6276 On Fri, Nov 15, 2019 at 12:40 AM Micah Kornfield wrote: I think there are potentially other places in the Arrow code base that "optional" could be useful (e.g. a row-reader like class for Arrow Tables). It looks like there is at least 1 header only optional library [1] that is c++17 forward compatible. I think I would lean towards vendoring that or another header only library, instead of depending on boost (I would need to double check and seem to recall there being difference between boost and the standard one). [1] https://github.com/martinmoene/optional-lite On Thu, Nov 14, 2019 at 1:22 PM Gawain Bolton wrote: Hello, I would like to add support for handling optional fields to the parquet::StreamReader and parquet::StreamWriter classes which I recently contributed (thank you!). Ideally I would do this by using std::optional like this: parquet::StreamWriter writer{ parquet::ParquetFileWriter::Open(...) }; std::optional d; writer << d; ... parquet::StreamReader os{parquet::ParquetFileReader::Open(...)}; reader >> d; However std::optional is only available in C++17 and arrow is compiled in C++11 mode. From what I see arrow does use Boost to a limited extent and in fact gandiva/cache.h uses the boost::optional class. So would it be possible to use the boost::optional class in parquet? Or perhaps someone can suggest another way of handling optional fields? Thanks in advance for your help, Gawain
[C++][Parquet]: Stream API handling of optional fields
Hello, I would like to add support for handling optional fields to the parquet::StreamReader and parquet::StreamWriter classes which I recently contributed (thank you!). Ideally I would do this by using std::optional like this: parquet::StreamWriter writer{ parquet::ParquetFileWriter::Open(...) }; std::optional d; writer << d; ... parquet::StreamReader os{parquet::ParquetFileReader::Open(...)}; reader >> d; However std::optional is only available in C++17 and arrow is compiled in C++11 mode. From what I see arrow does use Boost to a limited extent and in fact gandiva/cache.h uses the boost::optional class. So would it be possible to use the boost::optional class in parquet? Or perhaps someone can suggest another way of handling optional fields? Thanks in advance for your help, Gawain
[jira] [Created] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC
Gawain BOLTON created ARROW-6992: Summary: [C++]: Undefined Behavior sanitizer build option fails with GCC Key: ARROW-6992 URL: https://issues.apache.org/jira/browse/ARROW-6992 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Gawain BOLTON Then build with the "undefined behaviour sanitizer" option (-DARROW_USE_UBSAN=ON) the compilation fails with GCC: {noformat} c++: error: unrecognized argument to ‘-fno-sanitize=’ option: ‘function’{noformat} It appears that GCC has never had a "-fsanitize=function" option. I have fixed this issue and will submit a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)