[jira] [Created] (ARROW-217) Fix Travis w.r.t conda 4.1.0 changes

2016-06-15 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-217:
-

 Summary: Fix Travis w.r.t conda 4.1.0 changes
 Key: ARROW-217
 URL: https://issues.apache.org/jira/browse/ARROW-217
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (ARROW-214) C++: Add String support to Parquet I/O

2016-06-15 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-214:
-

Assignee: Uwe L. Korn

> C++: Add String support to Parquet I/O
> --
>
> Key: ARROW-214
> URL: https://issues.apache.org/jira/browse/ARROW-214
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> String support is currently not implemented in the readers. Also we have here 
> a slight difference between string type definition in Arrow (strings are 
> non-primitive) and Parquet (strings are a primitive type).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-217) Fix Travis w.r.t conda 4.1.0 changes

2016-06-15 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-217.

Resolution: Fixed

Issue resolved by pull request 90
[https://github.com/apache/arrow/pull/90]

> Fix Travis w.r.t conda 4.1.0 changes
> 
>
> Key: ARROW-217
> URL: https://issues.apache.org/jira/browse/ARROW-217
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-218) Add option to use GitHub API token via environment variable when merging PRs

2016-06-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-218:
--

 Summary: Add option to use GitHub API token via environment 
variable when merging PRs
 Key: ARROW-218
 URL: https://issues.apache.org/jira/browse/ARROW-218
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney
Assignee: Wes McKinney


While the patch tool only requires public repo access, on shared networks, the 
GitHub API rate limit may be exceeded for unauthenticated requests. This patch 
will add an option to use a GitHub personal access token to authenticate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: IO considerations for PyArrow

2016-06-15 Thread Wes McKinney
Hi folks,

I put some more thought into the "IO problem" as it relates Arrow in
C++ (and transitively, Python) and wrote a short Google document with
my thoughts on it:

https://docs.google.com/document/d/16y-eyIgSVL8m5Q7Mmh-jIDRwlh-r0bYatYuDl4sbMIk/edit#

Feedback greatly appreciated! This will be on my critical path in the
near future, so I would like to know if I'm approaching the problem
right, and we are in alignment (then can break things down into a
bunch of JIRAs).

(I can also post this doc directly to the mailing list, I thought the
initial discussion would be simpler in a GDoc)

Thank you
Wes

On Wed, Jun 8, 2016 at 4:11 PM, Wes McKinney  wrote:
> On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield  
> wrote:
>> Hi Wes,
>>
>> At what level do you imagine, the "opt-in" happening.  Right now it
>> seems like it would be fairly straightforward at build time.  However,
>> when we start packaging pyarrow for distribution how do you imagine it
>> will work? (If [1] already answers this, please let me know, I've been
>> meaning to take a look at it).
>>
>
> Where packaging and distribution is concerned, it'd be easiest to
> provide non-picky users with a kitchen sink build, but otherwise
> developers could create precisely the build they want with CMake
> flags, I guess. If certain libraries aren't found then we wouldn't
> fail the build by default, for example.
>
>> I need to grok the python code base a little bit more to understand
>> the implications of the scope creep and the pain around taking a more
>> fine-grained component approach.  But in general my experience has
>> been that packaging things together while maintaining clear internal
>> code boundaries for later separation is a good pragmatic approach.
>>
>
> I'd propose creating an `arrow_io` leaf shared library where we can
> create a small IO subsystem for reuse amongst different data
> connectors. We can leave things fairly coarse grained for the time
> being and break things up later if it becomes onerous for other Arrow
> developer-users.
>
>> As a side note, hopefully, we'll be able to re-use some existing
>> projects to do the heavy lifting for blob store integration.  SFrame
>> is one option [2] and [3] might be worth investigating as well (both
>> appear to be Apache 2.0 licensed).
>
> While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper
> around libhdfs) doesn't excite me that much, the prospect of bugs (or
> secure cluster issues) creeping up from a 3rd-party HDFS client
> without the ability to escalate problems to the Apache Hadoop team
> worries me even more. There is a new official C++ HDFS client in the
> works after the libhdfs3 patch was not accepted
> (https://issues.apache.org/jira/browse/HDFS-8707), so this may be
> worth pursuing once it matures.
>
> Thoughts on this welcome.
>
> - Wes
>
>>
>> Thanks,
>> -Micah
>>
>> [1] https://github.com/apache/arrow/pull/79/files
>> [2] https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3
>> [3] https://github.com/aws/aws-sdk-cpp
>>
>>