[jira] [Created] (ARROW-2502) [Rust] Restore Windows Compatibility

2018-04-23 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-2502:
--

 Summary: [Rust] Restore Windows Compatibility
 Key: ARROW-2502
 URL: https://issues.apache.org/jira/browse/ARROW-2502
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan


Windows support is currently broken due to a call to free in builder.rs and the 
memory_pool abstraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PyCon Sprint Document

2018-04-23 Thread Alex Hagerman

That makes sense :)

https://docs.google.com/document/d/1ENUapCkKccgqSQUooSlQIymeYPFL3H1nkRPy6rs5E5A/edit?usp=sharing 



Thanks,

Alex


On 04/22/2018 10:32 PM, Wes McKinney wrote:

Hi Alex,

Can you create a Google Document so we can comment and collaborate more
easily?

Thanks
Wes

On Sun, Apr 22, 2018, 4:23 PM Alex Hagerman  wrote:


Hi Everybody,

PyCon is next month May 9th-17th. During PyCon there are a few days of
sprints and I was hoping to spend some time on Arrow. The PyCon sprint page
mentions putting together a sprint document to share with participants.

https://us.pycon.org/2018/community/sprints/#!

http://opensource-events.com/

I've started putting together a document and thought I would share for
feedback. One of the things that was suggested is identifying a goal to
accomplish during the event. After talking with Uwe and looking through
JIRA I thought focusing on improving Numpy and Pandas integration would be
a good goal for PyCon. I plan on tagging tickets as well as getting some
filters prepared between now and PyCon, but would appreciate any other
advice others may have based on experience with Arrow dev or similar events
to the PyCon Sprints. It would also be great to know if anybody else is
going to make it to PyCon.


Thanks,

Alex





[jira] [Created] (ARROW-2501) Upgrade Jackson to 2.9.5

2018-04-23 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2501:
-

 Summary: Upgrade Jackson to 2.9.5
 Key: ARROW-2501
 URL: https://issues.apache.org/jira/browse/ARROW-2501
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Memory, Java - Vectors
Affects Versions: 0.9.0
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.10.0


I would like to upgrade Jackson to the latest version (2.9.5). If there are no 
objections I will create a PR (it is literally just changing the version number 
in the pom - no code changes required).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Changing Github activity in JIRA

2018-04-23 Thread Uwe L. Korn
Nice, this seems to work now.

Although it confused be at first that my github comment trigged a 10 minute 
work log :D

Uwe

On Mon, Apr 23, 2018, at 8:58 PM, Bryan Cutler wrote:
> They can do it for us, I filed
> https://issues.apache.org/jira/browse/INFRA-16426
> 
> On Mon, Apr 23, 2018 at 11:43 AM, Wes McKinney  wrote:
> 
> > That sounds like an INFRA-level thing, I'm sure they'll tell us if not
> >
> > On Mon, Apr 23, 2018 at 2:40 PM, Bryan Cutler  wrote:
> > > Yeah, I understand that we don't want to rely on a third party to store
> > all
> > > of our code discussions.  I took a look at a Beam issue and it had the
> > > Github discussion under "Work Log", so it does seem possible to do that.
> > > I'll send a question to INFRA about it, but is the configuration for this
> > > controlled by Arrow somewhere?
> > >
> > > On Fri, Apr 20, 2018 at 9:44 AM, Wes McKinney 
> > wrote:
> > >
> > >> hi Bryan,
> > >>
> > >> We definitely need to persist the GitHub activity on JIRA or a mailing
> > >> list somewhere because stuff on GitHub is not permanent and can be
> > >> deleted (e.g. comments or code reviews can be deleted). We should
> > >> inquire if there's a way to separate it from regular comments on JIRA
> > >> to make it easier for discussions on JIRA
> > >>
> > >> As for the e-mails, it's easy enough to filter out the
> > >> automatically-generated ones by ASF GitHub Bot if you don't want to
> > >> see them
> > >>
> > >> - Wes
> > >>
> > >> On Fri, Apr 20, 2018 at 12:39 PM, Antoine Pitrou 
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I agree with this.  Not receiving e-mail notifications for those would
> > >> > be nice as well (since I typically already receive e-mail
> > notifications
> > >> > from Github for the same activity).
> > >> >
> > >> > Regards
> > >> >
> > >> > Antoine.
> > >> >
> > >> >
> > >> > Le 20/04/2018 à 18:37, Bryan Cutler a écrit :
> > >> >> Hi All,
> > >> >>
> > >> >> I was just wondering if it was possible to move the Github activity
> > for
> > >> a
> > >> >> PR into a different tab in the JIRA, like "Work Log?"  Or maybe just
> > >> stop
> > >> >> posting it all together since the PR link is there?  It is usually a
> > >> ton of
> > >> >> text and makes it hard to have a discussion in the JIRA or go back
> > and
> > >> try
> > >> >> to look at certain comments.
> > >> >>
> > >> >> Thanks,
> > >> >> Bryan
> > >> >>
> > >>
> >


Re: Changing Github activity in JIRA

2018-04-23 Thread Bryan Cutler
They can do it for us, I filed
https://issues.apache.org/jira/browse/INFRA-16426

On Mon, Apr 23, 2018 at 11:43 AM, Wes McKinney  wrote:

> That sounds like an INFRA-level thing, I'm sure they'll tell us if not
>
> On Mon, Apr 23, 2018 at 2:40 PM, Bryan Cutler  wrote:
> > Yeah, I understand that we don't want to rely on a third party to store
> all
> > of our code discussions.  I took a look at a Beam issue and it had the
> > Github discussion under "Work Log", so it does seem possible to do that.
> > I'll send a question to INFRA about it, but is the configuration for this
> > controlled by Arrow somewhere?
> >
> > On Fri, Apr 20, 2018 at 9:44 AM, Wes McKinney 
> wrote:
> >
> >> hi Bryan,
> >>
> >> We definitely need to persist the GitHub activity on JIRA or a mailing
> >> list somewhere because stuff on GitHub is not permanent and can be
> >> deleted (e.g. comments or code reviews can be deleted). We should
> >> inquire if there's a way to separate it from regular comments on JIRA
> >> to make it easier for discussions on JIRA
> >>
> >> As for the e-mails, it's easy enough to filter out the
> >> automatically-generated ones by ASF GitHub Bot if you don't want to
> >> see them
> >>
> >> - Wes
> >>
> >> On Fri, Apr 20, 2018 at 12:39 PM, Antoine Pitrou 
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I agree with this.  Not receiving e-mail notifications for those would
> >> > be nice as well (since I typically already receive e-mail
> notifications
> >> > from Github for the same activity).
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >> >
> >> >
> >> > Le 20/04/2018 à 18:37, Bryan Cutler a écrit :
> >> >> Hi All,
> >> >>
> >> >> I was just wondering if it was possible to move the Github activity
> for
> >> a
> >> >> PR into a different tab in the JIRA, like "Work Log?"  Or maybe just
> >> stop
> >> >> posting it all together since the PR link is there?  It is usually a
> >> ton of
> >> >> text and makes it hard to have a discussion in the JIRA or go back
> and
> >> try
> >> >> to look at certain comments.
> >> >>
> >> >> Thanks,
> >> >> Bryan
> >> >>
> >>
>


Re: Changing Github activity in JIRA

2018-04-23 Thread Bryan Cutler
Yeah, I understand that we don't want to rely on a third party to store all
of our code discussions.  I took a look at a Beam issue and it had the
Github discussion under "Work Log", so it does seem possible to do that.
I'll send a question to INFRA about it, but is the configuration for this
controlled by Arrow somewhere?

On Fri, Apr 20, 2018 at 9:44 AM, Wes McKinney  wrote:

> hi Bryan,
>
> We definitely need to persist the GitHub activity on JIRA or a mailing
> list somewhere because stuff on GitHub is not permanent and can be
> deleted (e.g. comments or code reviews can be deleted). We should
> inquire if there's a way to separate it from regular comments on JIRA
> to make it easier for discussions on JIRA
>
> As for the e-mails, it's easy enough to filter out the
> automatically-generated ones by ASF GitHub Bot if you don't want to
> see them
>
> - Wes
>
> On Fri, Apr 20, 2018 at 12:39 PM, Antoine Pitrou 
> wrote:
> >
> > Hi,
> >
> > I agree with this.  Not receiving e-mail notifications for those would
> > be nice as well (since I typically already receive e-mail notifications
> > from Github for the same activity).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 20/04/2018 à 18:37, Bryan Cutler a écrit :
> >> Hi All,
> >>
> >> I was just wondering if it was possible to move the Github activity for
> a
> >> PR into a different tab in the JIRA, like "Work Log?"  Or maybe just
> stop
> >> posting it all together since the PR link is there?  It is usually a
> ton of
> >> text and makes it hard to have a discussion in the JIRA or go back and
> try
> >> to look at certain comments.
> >>
> >> Thanks,
> >> Bryan
> >>
>


Re: [Java] Upgrading Arrow to JDK 1.8

2018-04-23 Thread Bryan Cutler
+1 for Java 8

On Mon, Apr 23, 2018 at 11:18 AM, Dwight Gunning  wrote:

> A JIRA exists for this
>
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/ARROW-2015
>
> We were in agreement in principle on the migration - but when it was
> created there were a lot of focus on the 0.9 release.
>
> No reason not to begin now
>
> Dwight
> Sent from my iPhone
>
> > On Apr 23, 2018, at 2:00 PM, Li Jin  wrote:
> >
> > I don't see particular reason to maintaining JDK 7 compatibility, and am
> +1
> > to move to Java 8.
> >
> >> On Mon, Apr 23, 2018 at 1:48 PM, Andy Grove  wrote:
> >>
> >> I’m trying to use the parquet-arrow library, which has just been updated
> >> to use Arrow 0.8.0 but unfortunately I am still running into this issue:
> >>
> >> java.lang.ClassNotFoundException: org.apache.arrow.vector.types.
> >> pojo.ArrowType$Struct_
> >>
> >> The class in the arrow jar is actually `Struct` not `Struct_`.
> >>
> >> This is due to Arrow using JDK 7 and Parquet using JDK 8.
> >>
> >> I have confirmed this on my forks and have submitted a PR to upgrade
> Arrow
> >> to JDK 1.8 but the CI tests are running against both JDK 7 and 8 and
> >> obviously the JDK 7 tests are failing.
> >>
> >> Is there any reason for maintaining JDK 7 compatibility? JDK 7 end of
> life
> >> was 3 years ago.
> >>
> >> Thanks,
> >>
> >> Andy.
> >>
> >>
>


Re: [Java] Upgrading Arrow to JDK 1.8

2018-04-23 Thread Dwight Gunning
A JIRA exists for this

https://issues.apache.org/jira/plugins/servlet/mobile#issue/ARROW-2015

We were in agreement in principle on the migration - but when it was created 
there were a lot of focus on the 0.9 release.

No reason not to begin now

Dwight
Sent from my iPhone

> On Apr 23, 2018, at 2:00 PM, Li Jin  wrote:
> 
> I don't see particular reason to maintaining JDK 7 compatibility, and am +1
> to move to Java 8.
> 
>> On Mon, Apr 23, 2018 at 1:48 PM, Andy Grove  wrote:
>> 
>> I’m trying to use the parquet-arrow library, which has just been updated
>> to use Arrow 0.8.0 but unfortunately I am still running into this issue:
>> 
>> java.lang.ClassNotFoundException: org.apache.arrow.vector.types.
>> pojo.ArrowType$Struct_
>> 
>> The class in the arrow jar is actually `Struct` not `Struct_`.
>> 
>> This is due to Arrow using JDK 7 and Parquet using JDK 8.
>> 
>> I have confirmed this on my forks and have submitted a PR to upgrade Arrow
>> to JDK 1.8 but the CI tests are running against both JDK 7 and 8 and
>> obviously the JDK 7 tests are failing.
>> 
>> Is there any reason for maintaining JDK 7 compatibility? JDK 7 end of life
>> was 3 years ago.
>> 
>> Thanks,
>> 
>> Andy.
>> 
>> 


[jira] [Created] (ARROW-2499) [C++] Add iterator facility for Python sequences

2018-04-23 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2499:
-

 Summary: [C++] Add iterator facility for Python sequences
 Key: ARROW-2499
 URL: https://issues.apache.org/jira/browse/ARROW-2499
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


The idea is to factor out something like the following:
https://github.com/apache/arrow/pull/1935/files#diff-6ea0fcd65b95b76eab9ddfbd7a173725R78

However I'm not sure which idiom or pattern we should choose. [~cpcloud] any 
idea?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Java] Upgrading Arrow to JDK 1.8

2018-04-23 Thread Li Jin
I don't see particular reason to maintaining JDK 7 compatibility, and am +1
to move to Java 8.

On Mon, Apr 23, 2018 at 1:48 PM, Andy Grove  wrote:

> I’m trying to use the parquet-arrow library, which has just been updated
> to use Arrow 0.8.0 but unfortunately I am still running into this issue:
>
> java.lang.ClassNotFoundException: org.apache.arrow.vector.types.
> pojo.ArrowType$Struct_
>
> The class in the arrow jar is actually `Struct` not `Struct_`.
>
> This is due to Arrow using JDK 7 and Parquet using JDK 8.
>
> I have confirmed this on my forks and have submitted a PR to upgrade Arrow
> to JDK 1.8 but the CI tests are running against both JDK 7 and 8 and
> obviously the JDK 7 tests are failing.
>
> Is there any reason for maintaining JDK 7 compatibility? JDK 7 end of life
> was 3 years ago.
>
> Thanks,
>
> Andy.
>
>


[Java] Upgrading Arrow to JDK 1.8

2018-04-23 Thread Andy Grove
I’m trying to use the parquet-arrow library, which has just been updated to use 
Arrow 0.8.0 but unfortunately I am still running into this issue:

java.lang.ClassNotFoundException: 
org.apache.arrow.vector.types.pojo.ArrowType$Struct_

The class in the arrow jar is actually `Struct` not `Struct_`.

This is due to Arrow using JDK 7 and Parquet using JDK 8.

I have confirmed this on my forks and have submitted a PR to upgrade Arrow to 
JDK 1.8 but the CI tests are running against both JDK 7 and 8 and obviously the 
JDK 7 tests are failing.

Is there any reason for maintaining JDK 7 compatibility? JDK 7 end of life was 
3 years ago.

Thanks,

Andy.



[jira] [Created] (ARROW-2498) [Java] Upgrade to JDK 1.8

2018-04-23 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2498:
-

 Summary: [Java] Upgrade to JDK 1.8
 Key: ARROW-2498
 URL: https://issues.apache.org/jira/browse/ARROW-2498
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Memory, Java - Vectors
Affects Versions: 0.11.0
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.11.0


I'm trying to use the parquet-arrow module from parquet-mr but I'm running into 
this error which I'm pretty sure is because the two projects use different 
major versions of Java:
{code:java}
  Cause: java.lang.ClassNotFoundException: 
org.apache.arrow.vector.types.pojo.ArrowType$Struct_{code}
The struct is actually named `Struct` not `Struct_`.

This PR is to track work to upgrade to JDK 1.8

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Problem when saving large files via pyarrow hdfs

2018-04-23 Thread Wes McKinney
hi Jan,

The issue is that the hdfsWrite API uses int32_t (aka "tSize") for write sizes:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs-internal.h#L69

So when writing files over INT32_MAX, we must write in chunks. Can you
please open a JIRA with your bug report and this information so that
this can be fixed in a future release?

Thanks!
Wes

On Thu, Apr 19, 2018 at 7:14 AM, Jan-Hendrik Zab  wrote:
>
> Hello!
>
> I'm currently trying to use pyarrows hdfs lib from within hadoop
> streaming, specifically in the reducer with python 3.6 (anaconda). But
> the mentioned problem occurs either way. pyarrow version is 0.9.0
>
> I'm starting the actual python script via a wrapper sh script that sets
> the LD_LIBRARY_PATH, since I found that setting it from wihin python was
> not sufficient..
>
> When I'm just testing the reducer by piping in data manually and trying
> to save data (in this case a gensim model) that is roughly 3GB I only
> get the following error message:
>
>
> File "reducer.py", line 104, in 
>   save_model(model)
> File "reducer.py", line 65, in save_model
>   model.save(model_fd, sep_limit=1024 * 1024, pickle_protocol=4)
> File "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/word2vec.py", 
> line 930, in save
>   super(Word2Vec, self).save(*args, **kwargs)
> File 
> "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", 
> line 281, in save
>   super(BaseAny2VecModel, self).save(fname_or_handle, **kwargs)
> File "/opt/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 688, 
> in save
>   _pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
> File "io.pxi", line 220, in pyarrow.lib.NativeFile.write
> File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: HDFS: Write failed
>
> Files with 700MB in size seem to work fine though. Our default block
> size is 128MB.
>
> The code to save the model is the following:
>
> model = word2vec.Word2Vec(size=300, workers=8, iter=1, sg=1)
> # building model here [removed]
> hdfs_client = hdfs.connect(active_master)
> with hdfs_client.open("/user/zab/w2v/%s_test.model" % key, 'wb') as model_fd:
> model.save(model_fd, sep_limit=1024 * 1024)
>
> I would appreciate any help :-)
>
> Best,
> Jan
>
> --
> Leibniz Universität Hannover
> Institut für Verteilte Systeme
> Appelstrasse 4 - 30167 Hannover
> Phone:  +49 (0)511 762 - 17706
> Tax ID/Steuernummer: DE811245527


Re: Continuous benchmarking setup

2018-04-23 Thread Tom Augspurger
Currently, there are 3 snowflakes :)

- Benchmark setup: https://github.com/TomAugspurger/asv-runner
  + Some setup to bootstrap a clean install with airflow, conda, asv,
supervisor, etc. All the infrastructure around running the benchmarks.
  + Each project adds itself to the list of benchmarks, as in
https://github.com/TomAugspurger/asv-runner/pull/3. Then things are
re-deployed. Deployment requires ansible and an SSH key for the benchmark
machine
- Benchmark publishing: After running all the benchmarks, the results are
collected and pushed to https://github.com/tomaugspurger/asv-collection
- Benchmark hosting: A cron job on the server hosting pandas docs pulls
https://github.com/tomaugspurger/asv-collection and serves them from the
`/speed` directory.

There are many things that could be improved on here, but I personally
won't have time in the near term. Happy to assist though.

On Mon, Apr 23, 2018 at 10:15 AM, Wes McKinney  wrote:

> hi Tom -- is the publishing workflow for this documented someplace, or
> available in a GitHub repo? We want to make sure we don't accumulate
> any "snowflakes" in the development process.
>
> thanks!
> Wes
>
> On Fri, Apr 13, 2018 at 8:36 AM, Tom Augspurger
>  wrote:
> > They are run daily and published to http://pandas.pydata.org/speed/
> >
> >
> > 
> > From: Antoine Pitrou 
> > Sent: Friday, April 13, 2018 4:28:11 AM
> > To: dev@arrow.apache.org
> > Subject: Re: Continuous benchmarking setup
> >
> >
> > Nice! Are the benchmark results published somewhere?
> >
> >
> >
> > Le 13/04/2018 à 02:50, Tom Augspurger a écrit :
> >> https://github.com/TomAugspurger/asv-runner/ is the setup for the
> projects currently running. Adding arrow to  https://github.com/
> TomAugspurger/asv-runner/blob/master/tests/full.yml might work. I'll have
> to redeploy with the update.
> >>
> >> 
> >> From: Wes McKinney 
> >> Sent: Thursday, April 12, 2018 7:24:20 PM
> >> To: dev@arrow.apache.org
> >> Subject: Re: Continuous benchmarking setup
> >>
> >> hi Antoine,
> >>
> >> I have a bare metal machine at home (affectionately known as the
> >> "pandabox") that's available via SSH that we've been using for
> >> continuous benchmarking for other projects. Arrow is welcome to use
> >> it. I can give you access to the machine if you would like. Hopefully,
> >> we can suitably the process of setting up a continuous benchmarking
> >> machine so that if we need to migrate to a new machine, it is not too
> >> much of a hardship to do so.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou 
> wrote:
> >>>
> >>> Hello
> >>>
> >>> With the following changes, it seems we might reach the point where
> >>> we're able to run the Python-based benchmark suite accross multiple
> >>> commits (at least the ones not anterior to those changes):
> >>> https://github.com/apache/arrow/pull/1775
> >>>
> >>> To make this truly useful, we would need a dedicated host.  Ideally a
> >>> (Linux) OS running on bare metal, with SMT/HyperThreading disabled.
> >>> If running virtualized, the VM should have dedicated physical CPU
> cores.
> >>>
> >>> That machine would run the benchmarks on a regular basis (perhaps once
> >>> per night) and publish the results in static HTML form somewhere.
> >>>
> >>> (note: nice to have in the future might be access to NVidia hardware,
> >>> but right now there are no CUDA benchmarks in the Python benchmarks)
> >>>
> >>> What should be the procedure here?
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>
>


Re: Continuous benchmarking setup

2018-04-23 Thread Wes McKinney
hi Tom -- is the publishing workflow for this documented someplace, or
available in a GitHub repo? We want to make sure we don't accumulate
any "snowflakes" in the development process.

thanks!
Wes

On Fri, Apr 13, 2018 at 8:36 AM, Tom Augspurger
 wrote:
> They are run daily and published to http://pandas.pydata.org/speed/
>
>
> 
> From: Antoine Pitrou 
> Sent: Friday, April 13, 2018 4:28:11 AM
> To: dev@arrow.apache.org
> Subject: Re: Continuous benchmarking setup
>
>
> Nice! Are the benchmark results published somewhere?
>
>
>
> Le 13/04/2018 à 02:50, Tom Augspurger a écrit :
>> https://github.com/TomAugspurger/asv-runner/ is the setup for the projects 
>> currently running. Adding arrow to  
>> https://github.com/TomAugspurger/asv-runner/blob/master/tests/full.yml might 
>> work. I'll have to redeploy with the update.
>>
>> 
>> From: Wes McKinney 
>> Sent: Thursday, April 12, 2018 7:24:20 PM
>> To: dev@arrow.apache.org
>> Subject: Re: Continuous benchmarking setup
>>
>> hi Antoine,
>>
>> I have a bare metal machine at home (affectionately known as the
>> "pandabox") that's available via SSH that we've been using for
>> continuous benchmarking for other projects. Arrow is welcome to use
>> it. I can give you access to the machine if you would like. Hopefully,
>> we can suitably the process of setting up a continuous benchmarking
>> machine so that if we need to migrate to a new machine, it is not too
>> much of a hardship to do so.
>>
>> Thanks
>> Wes
>>
>> On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou  wrote:
>>>
>>> Hello
>>>
>>> With the following changes, it seems we might reach the point where
>>> we're able to run the Python-based benchmark suite accross multiple
>>> commits (at least the ones not anterior to those changes):
>>> https://github.com/apache/arrow/pull/1775
>>>
>>> To make this truly useful, we would need a dedicated host.  Ideally a
>>> (Linux) OS running on bare metal, with SMT/HyperThreading disabled.
>>> If running virtualized, the VM should have dedicated physical CPU cores.
>>>
>>> That machine would run the benchmarks on a regular basis (perhaps once
>>> per night) and publish the results in static HTML form somewhere.
>>>
>>> (note: nice to have in the future might be access to NVidia hardware,
>>> but right now there are no CUDA benchmarks in the Python benchmarks)
>>>
>>> What should be the procedure here?
>>>
>>> Regards
>>>
>>> Antoine.
>>


[jira] [Created] (ARROW-2497) Use ASSERT_NO_FATAIL_FAILURE in C++ unit tests

2018-04-23 Thread Joshua Storck (JIRA)
Joshua Storck created ARROW-2497:


 Summary: Use ASSERT_NO_FATAIL_FAILURE in C++ unit tests
 Key: ARROW-2497
 URL: https://issues.apache.org/jira/browse/ARROW-2497
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joshua Storck


A number of unit tests have helper functions that use gtest/arrow ASSERT_ 
macros. Those ASSERT_ macros simply return out of the current context and do 
not throw exceptions or abort. Since these helper functions return void, the 
unit test simply continues when the assertions are triggered. This can lead to 
additional failures, such as segfaults because the test is executing code that 
it did not expect to. By adding the gtest ASSERT_NO_FATAIL_FAILURE to the calls 
of those helper functions in the outermost scope of the unit test, the test 
will correctly terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: C++ RecordBatchWriter/ReadRecordBatch clarification

2018-04-23 Thread Wes McKinney
> So, I guess the ReadRecordBatch function is intended to only work if the 
> records were written by RecordBatchFileWriter, right?

The *StreamWriter and *FileWriter classes use identical code paths for
writing the IPC messages, the only difference is the preamble (the
magic number and padding) and the file footer written at the end, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L919

https://github.com/apache/arrow/blob/master/format/IPC.md#file-format

I'm looking at your code in
https://github.com/Paradigm4/accelerated_io_tools/blob/master/src/PhysicalAioSave.cpp#L1547
 -- it is not going to work because you wrote a stream that includes
the schema as the first message. If you use
arrow::ipc::WriteRecordBatch instead, then things will work fine.
Another option is to use the generic ReadMessage function twice,
skipping the schema message if you don't need it

Hope this helps
Wes

On Sun, Apr 22, 2018 at 2:33 PM, Rares Vernica  wrote:
> Hi Dimitri,
>
> Thanks for that. I was going something like that. My hope was to use
> StreamWriter to write the batch and then use ReadRecordBatch to read it
> since it is more succinct and I know I only have one batch to read.
>
> Here is the actual code I use
> https://github.com/Paradigm4/accelerated_io_tools/blob/master/src/PhysicalAioSave.cpp#L1547
> and above it, commended out, is the code I would like to use.
>
> So, I guess the ReadRecordBatch function is intended to only work if the
> records were written by RecordBatchFileWriter, right?
>
> Cheers,
> Rares
>
>
> On Tue, Apr 17, 2018 at 1:00 AM, Dimitri Vorona 
> wrote:
>
>> Hi Rares,
>>
>> you use a different reader for the RecordBatch streams. See
>> arrow/ipc/ipc-read-write-test.cc:569-596 for the gist. Also, the second
>> argument to arrow::RecordBatch::Make takes the number of rows in the batch,
>> so you have  to set it to 1 in your example.
>>
>> See https://gist.github.com/alendit/c6cdd1adaf7007786392731152d3b6b9
>>
>> Cheers,
>> Dimitri.
>>
>> On Tue, Apr 17, 2018 at 3:52 AM, Rares Vernica  wrote:
>>
>> > Hi,
>> >
>> > I'm writing a batch of records to a stream and I want to read them
>> later. I
>> > notice that if I use the RecordBatchStreamWriter class to write them and
>> > then ReadRecordBatch function to read them, I get a Segmentation Fault.
>> >
>> > On the other hand, if I use the RecordBatchFileWriter class to write
>> them,
>> > the reading works fine.
>> >
>> > So, is the arrow::ipc::ReadRecordBatch function intended to only work if
>> > the records were written by RecordBatchFileWriter?
>> >
>> > Below is a complete example, showing the two cases. I tried this on
>> Ubuntu
>> > Trusty with Arrow 0.9.0-1
>> >
>> > Thanks!
>> > Rares
>> >
>> >
>> >
>> >
>> >
>> > // g++-4.9 -ggdb -std=c++11 foo.cpp -larrow
>> >
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> >
>> > int main()
>> > {
>> > arrow::MemoryPool* pool = arrow::default_memory_pool();
>> >
>> > std::shared_ptr buffer(new
>> > arrow::PoolBuffer(pool));
>> >
>> > arrow::Int64Builder builder(pool);
>> > builder.Append(1);
>> >
>> > std::shared_ptr array;
>> > builder.Finish();
>> >
>> > std::vector schema_vector =
>> > {arrow::field("id", arrow::int64())};
>> >
>> > auto schema = std::make_shared(schema_vector);
>> >
>> >
>> > // Write
>> > std::shared_ptr batchOut;
>> > batchOut = arrow::RecordBatch::Make(schema, 10, {array});
>> >
>> > std::unique_ptr stream;
>> > stream.reset(new arrow::io::BufferOutputStream(buffer));
>> >
>> > std::shared_ptr writer;
>> >
>> > // #1 - Segmentation fault (core dumped)
>> > arrow::ipc::RecordBatchStreamWriter::Open(stream.get(), schema,
>> > );
>> >
>> > // #2 - OK
>> > //arrow::ipc::RecordBatchStreamWriter::Open(stream.get(), schema,
>> > );
>> >
>> > writer->WriteRecordBatch(*batchOut);
>> >
>> > writer->Close();
>> > stream->Close();
>> >
>> >
>> > // Read
>> > arrow::io::BufferReader reader(buffer);
>> > std::shared_ptr batchIn;
>> > arrow::ipc::ReadRecordBatch(schema, , );
>> > }
>> >
>>


[jira] [Created] (ARROW-2496) [C++] Add support for Libhdfs++

2018-04-23 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-2496:


 Summary: [C++] Add support for Libhdfs++
 Key: ARROW-2496
 URL: https://issues.apache.org/jira/browse/ARROW-2496
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Deepak Majeti
Assignee: Deepak Majeti


Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS 
project. Details are available here.

https://issues.apache.org/jira/browse/HDFS-8707

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2495) [Plasma] Pretty print plasma objects

2018-04-23 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2495:
--

 Summary: [Plasma] Pretty print plasma objects
 Key: ARROW-2495
 URL: https://issues.apache.org/jira/browse/ARROW-2495
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs


The implementation should based on 
[arrow::pretty_print|https://github.com/apache/arrow/blob/master/cpp/src/arrow/pretty_print.h].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2494) Return status codes from PlasmaClient::Seal

2018-04-23 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2494:
--

 Summary: Return status codes from PlasmaClient::Seal
 Key: ARROW-2494
 URL: https://issues.apache.org/jira/browse/ARROW-2494
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)