Re: Fix for bug in parquet stream writer

2020-12-29 Thread Gawain Bolton
Hello,

I contributed the parquet StreamWriter class and I can tell you that
the type checking is not legacy logic and removing it is not the
solution.

As you have suggested the solution is to add an output operator to add
support for the type and this can be done by deriving from the
StreamWriter class.

Ideally though all basic C++ and parquet types should be supported.
 Users should only have to  add output operators for custom types.

So I think a PR to add missing support for any parquet types  would be
nice.

Cheers,

Gawain

On Mon, 2020-12-28 at 17:34 +0100, anders johansson wrote:
> Hi,
> 
> Not sure if my fix is a correct or permanent solution, depending on your
> plan regarding backwards compatibility, etc. Someone with a deeper
> understanding of how the code works or is supposed to work should probably
> have a look at it.
> 
> A temporary workaround is to inherit from StreamWriter and define your own
> ostream overloads for logical types that are not supported. Example for
> date 32:
> 
> class StreamWriterEx : public StreamWriter {
> 
>   void WriteDate32Raw(int32_t d) {
> CheckColumn(Type::INT32, ConvertedType::DATE);
> Write(d);
>   }
> };
> 
> 
> To determine the expected ConvertedType, run the code once and look at the
> error message before inheriting, or just do without the CheckColumn() call.
> 
> //A
> 
> 
> On Mon, Dec 28, 2020 at 4:55 PM Wes McKinney  wrote:
> 
> > hi Anders, would you like to open a Jira issue and submit a PR (with
> > unit test)?
> > 
> > On Mon, Dec 28, 2020 at 9:51 AM anders johansson
> >  wrote:
> > > 
> > > Hi,
> > > 
> > > When writing to a primitive node of a logical type not supported by
> > > converted_type (such as parquet::LogicalType::TimeUnit::NANOS), the error
> > > "Column converted type mismatch" is thrown. As I understand it, the
> > > converted_type logic is legacy. The problem is solved by removing
> > > 
> > >   if (converted_type != node->converted_type()) {
> > > throw ParquetException("Column converted type mismatch.  Column '" +
> > > node->name() +
> > >"' has converted type[" +
> > >ConvertedTypeToString(node->converted_type())
> > +
> > > "] not '" +
> > >ConvertedTypeToString(converted_type) + "'");
> > >   }
> > > 
> > > from StreamWriter::CheckColumn() in src/parquet/stream_writer.cc
> > > 
> > > BR,
> > > //Anders
> > 


[jira] [Created] (ARROW-7745) [Doc] [C++] Update Parquet documentation

2020-02-02 Thread Gawain BOLTON (Jira)
Gawain BOLTON created ARROW-7745:


 Summary: [Doc] [C++] Update Parquet documentation
 Key: ARROW-7745
 URL: https://issues.apache.org/jira/browse/ARROW-7745
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Gawain BOLTON
Assignee: Gawain BOLTON






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Arrow 0.16.0 - RC1

2020-01-28 Thread Gawain Bolton

Hello,

It would seem that the list of issues does not include any of the issues 
in the Parquet project which were fixed in this release.


Cheers,

Gawain

On 28/01/2020 11:46, Krisztián Szűcs wrote:

Sorry, the previous email is hardly readable.

I would like to propose the following release candidate (RC1) of Apache
Arrow version 0.16.0. This is a release consisting of 710 resolved JIRA
issues[1].

This release candidate is based on commit:
188afde1f4298fb668e8ebadeacbc545e2de086f [2]

The source release rc1 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7].
The changelog is located at [8].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [9] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 0.16.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 0.16.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.16.0
[2]: 
https://github.com/apache/arrow/tree/188afde1f4298fb668e8ebadeacbc545e2de086f
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.16.0-rc1
[4]: https://bintray.com/apache/arrow/centos-rc/0.16.0-rc1
[5]: https://bintray.com/apache/arrow/debian-rc/0.16.0-rc1
[6]: https://bintray.com/apache/arrow/python-rc/0.16.0-rc1
[7]: https://bintray.com/apache/arrow/ubuntu-rc/0.16.0-rc1
[8]: 
https://github.com/apache/arrow/blob/188afde1f4298fb668e8ebadeacbc545e2de086f/CHANGELOG.md
[9]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates

On Tue, Jan 28, 2020 at 11:43 AM Krisztián Szűcs
 wrote:

Hi,

I would like to propose the following release candidate (RC1) of Apache
Arrow version 0.16.0. This is a release consisting of 710
resolved JIRA issues[1].

This release candidate is based on commit:



188afde1f4298fb668e8ebadeacbc545e2de086f [2]



  The source release rc1 is
hosted at [3].
The binary artifacts are hosted at [4][5][6][7].


  The changelog is located at
[8].



  Please download, verify
checksums and signatures, run the unit tests,
and vote on the release. See [9] for how to validate a release
candidate.





The vote will be open for at least 72 hours.



  [ ] +1 Release this as Apache
Arrow 0.16.0
[ ] +0


  [ ] -1 Do not release this as
Apache Arrow 0.16.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.16.0
[2]: 
https://github.com/apache/arrow/tree/188afde1f4298fb668e8ebadeacbc545e2de086f
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.16.0-rc1
[4]: https://bintray.com/apache/arrow/centos-rc/0.16.0-rc1
[5]: https://bintray.com/apache/arrow/debian-rc/0.16.0-rc1
[6]: https://bintray.com/apache/arrow/python-rc/0.16.0-rc1
[7]: https://bintray.com/apache/arrow/ubuntu-rc/0.16.0-rc1
[8]: 
https://github.com/apache/arrow/blob/188afde1f4298fb668e8ebadeacbc545e2de086f/CHANGELOG.md
[9]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


[jira] [Created] (ARROW-7294) [Python] converted_type_name_from_enum(): Incorrect name for INT_64

2019-12-02 Thread Gawain BOLTON (Jira)
Gawain BOLTON created ARROW-7294:


 Summary: [Python] converted_type_name_from_enum(): Incorrect name 
for INT_64
 Key: ARROW-7294
 URL: https://issues.apache.org/jira/browse/ARROW-7294
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Gawain BOLTON
Assignee: Gawain BOLTON


The INT_64 type is converted to "UINT_64"

It should be converted to "INT_64"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++][Parquet]: Stream API handling of optional fields

2019-11-16 Thread Gawain Bolton

Thanks for your reply.

If I understand correctly ARROW-7178 must be done so that Arrow has a 
version of std::optional which Parquet could then use.


I think I will submit a PR for this shortly.

Gawain

On 15/11/2019 14:05, Francois Saint-Jacques wrote:

I'm all for it. Created [1] it would also enable an operator[] for
arrays of primitive types [2].

[1] https://issues.apache.org/jira/browse/ARROW-7178
[2] https://issues.apache.org/jira/browse/ARROW-6276

On Fri, Nov 15, 2019 at 12:40 AM Micah Kornfield  wrote:

I think there are potentially other places in the Arrow code base that
"optional" could be useful (e.g. a row-reader like class for Arrow
Tables).  It looks like there is at least 1 header only optional library
[1] that is c++17 forward compatible.  I think I would lean towards
vendoring that or another header only library, instead of depending on
boost (I would need to double check and seem to recall there being
difference between boost and the standard one).

[1] https://github.com/martinmoene/optional-lite

On Thu, Nov 14, 2019 at 1:22 PM Gawain Bolton  wrote:


Hello,

I would like to add support for handling optional fields to the
parquet::StreamReader and parquet::StreamWriter classes which I recently
contributed (thank you!).

Ideally I would do this by using std::optional like this:

  parquet::StreamWriter writer{ parquet::ParquetFileWriter::Open(...) };

  std::optional d;

  writer << d;

  ...

  parquet::StreamReader os{parquet::ParquetFileReader::Open(...)};

  reader >> d;

However std::optional is only available in C++17 and arrow is compiled
in C++11 mode.

  From what I see arrow does use Boost to a limited extent and in fact
gandiva/cache.h uses the boost::optional class.

So would it be possible to use the boost::optional class in parquet?

Or perhaps someone can suggest another way of handling optional fields?

Thanks in advance for your help,

Gawain





[C++][Parquet]: Stream API handling of optional fields

2019-11-14 Thread Gawain Bolton

Hello,

I would like to add support for handling optional fields to the 
parquet::StreamReader and parquet::StreamWriter classes which I recently 
contributed (thank you!).


Ideally I would do this by using std::optional like this:

    parquet::StreamWriter writer{ parquet::ParquetFileWriter::Open(...) };

    std::optional d;

    writer << d;

    ...

    parquet::StreamReader os{parquet::ParquetFileReader::Open(...)};

    reader >> d;

However std::optional is only available in C++17 and arrow is compiled 
in C++11 mode.


From what I see arrow does use Boost to a limited extent and in fact 
gandiva/cache.h uses the boost::optional class.


So would it be possible to use the boost::optional class in parquet?

Or perhaps someone can suggest another way of handling optional fields?

Thanks in advance for your help,

Gawain




[jira] [Created] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC

2019-10-25 Thread Gawain BOLTON (Jira)
Gawain BOLTON created ARROW-6992:


 Summary: [C++]: Undefined Behavior sanitizer build option fails 
with GCC
 Key: ARROW-6992
 URL: https://issues.apache.org/jira/browse/ARROW-6992
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Gawain BOLTON


Then build with the "undefined behaviour sanitizer" option 
(-DARROW_USE_UBSAN=ON) the compilation fails with GCC:
{noformat}
c++: error: unrecognized argument to ‘-fno-sanitize=’ option: 
‘function’{noformat}
It appears that GCC has never had a "-fsanitize=function" option.

I have fixed this issue and will submit a PR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)