date:20200107

[jira] [Commented] (ARROW-7476) [Python] Arrow error: IOError: Error reading bytes from file: No error

2020-01-07 Thread gaurav vashisth (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010445#comment-17010445
 ] 

gaurav vashisth commented on ARROW-7476:


UPDATE: This error can occur in the file having having 1 million records as 
well. 

> [Python] Arrow error: IOError: Error reading bytes from file: No error
> --
>
> Key: ARROW-7476
> URL: https://issues.apache.org/jira/browse/ARROW-7476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: windows
>Reporter: gaurav vashisth
>Priority: Major
>
> when I try to read a parquet file using either Pandas or dask, I get error 
> following error:
> Arrow error: IOError: Error reading bytes from file: No error. However, when 
> I try again to read the file, sometime I'm able to read the file. Below are 
> the command I used to read the parquet file.
> with dask
> dd.read_parquet('my.parquet', engine='pyarrow',compression='snappy').compute()
> {color:#172b4d}pd.read_parquet('my.parquet', 
> engine='pyarrow',compression='snappy'){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7514) [C#] Make GetValueOffset Obsolete

2020-01-07 Thread Takashi Hashida (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takashi Hashida updated ARROW-7514:
---
Description: 
[BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
 and 
[ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
 no longer have value.

We should add an `Obsolete` attribute to these methods in the next release, 
then remove these methods in the future release.

 

See this discussion: 
[https://github.com/apache/arrow/pull/6029#discussion_r361505788]

  was:
[BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
 and 
[ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
 no longer have value.



We should add an `Obsolete` attribute to these methods in the next release, 
then remove these methods in the future release.

 

Show this discussion: 
[https://github.com/apache/arrow/pull/6029#discussion_r361505788]


> [C#] Make GetValueOffset Obsolete
> -
>
> Key: ARROW-7514
> URL: https://issues.apache.org/jira/browse/ARROW-7514
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Takashi Hashida
>Priority: Major
>
> [BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
>  and 
> [ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
>  no longer have value.
> We should add an `Obsolete` attribute to these methods in the next release, 
> then remove these methods in the future release.
>  
> See this discussion: 
> [https://github.com/apache/arrow/pull/6029#discussion_r361505788]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7045) [R] Factor type not preserved in Parquet roundtrip

2020-01-07 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7045:
--

Assignee: Hiroaki Yutani

> [R] Factor type not preserved in Parquet roundtrip
> --
>
> Key: ARROW-7045
> URL: https://issues.apache.org/jira/browse/ARROW-7045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Hiroaki Yutani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:r}
> test_that("Factors are preserved when writing/reading from Parquet", {
>   tf <- tempfile()
>   on.exit(unlink(tf))
>   df <- data.frame(a = factor(c("a", "b")))
>   write_parquet(df, tf)
>   expect_equivalent(read_parquet(tf), df)
> })
> {code}
> Fails:
> {code}
> `object` not equivalent to `expected`.
> Component “a”: target is character, current is factor
> {code}
> This has to do with the translation with Parquet and not the R <--> Arrow 
> type mapping (unlike ARROW-7028). If you write_feather and read_feather, the 
> test passes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7045) [R] Factor type not preserved in Parquet roundtrip

2020-01-07 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7045.

Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6135
[https://github.com/apache/arrow/pull/6135]

> [R] Factor type not preserved in Parquet roundtrip
> --
>
> Key: ARROW-7045
> URL: https://issues.apache.org/jira/browse/ARROW-7045
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:r}
> test_that("Factors are preserved when writing/reading from Parquet", {
>   tf <- tempfile()
>   on.exit(unlink(tf))
>   df <- data.frame(a = factor(c("a", "b")))
>   write_parquet(df, tf)
>   expect_equivalent(read_parquet(tf), df)
> })
> {code}
> Fails:
> {code}
> `object` not equivalent to `expected`.
> Component “a”: target is character, current is factor
> {code}
> This has to do with the translation with Parquet and not the R <--> Arrow 
> type mapping (unlike ARROW-7028). If you write_feather and read_feather, the 
> test passes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7514) [C#] Make GetValueOffset Obsolete

2020-01-07 Thread Takashi Hashida (Jira)

Takashi Hashida created ARROW-7514:
--

 Summary: [C#] Make GetValueOffset Obsolete
 Key: ARROW-7514
 URL: https://issues.apache.org/jira/browse/ARROW-7514
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Takashi Hashida


[BinaryArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/BinaryArray.cs#L172]
 and 
[ListArray.GetValueOffset|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/ListArray.cs#L47]
 no longer have value.



We should add an `Obsolete` attribute to these methods in the next release, 
then remove these methods in the future release.

 

Show this discussion: 
[https://github.com/apache/arrow/pull/6029#discussion_r361505788]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7513) [JS] Arrow Tutorial: Common data types

2020-01-07 Thread Leo Meyerovich (Jira)

Leo Meyerovich created ARROW-7513:
-

Summary: [JS] Arrow Tutorial: Common data types
Key: ARROW-7513
URL: https://issues.apache.org/jira/browse/ARROW-7513
Project: Apache Arrow
Issue Type: Task
Components: JavaScript
Reporter: Leo Meyerovich
Assignee: Leo Meyerovich

The JS client lacks basic introductory material around creating the common
basic data types such as turning JS arrays into ints, dicts, etc. There is no
equivalent of Python's [https://arrow.apache.org/docs/python/data.html] . This
has made use for myself difficult, and I bet for others.

As with prev tutorials, I started sketching on
[https://observablehq.com/@lmeyerov/rich-data-types-in-apache-arrow-js-efficient-data-tables-wit]
. When we're happy can make sense to export as an html or something to the
repo, or just link from the main readme.

I believe the target topics worth covering are:
* Common user data types: Ints, Dicts, Struct, Time
* Common column types: Data, Vector, Column
* Going from individual & arrays & buffers of JS values to Arrow-wrapped
forms, and basic inspection of the result

Not worth going into here is Tables vs. RecordBatches, which is the other
tutorial.

1. Ideas of what to add/edit/remove?

2. And anyone up for helping with discussion of Data vs. Vector, and ingest of
Time & Struct?

3. ... Should we be encouraging Struct or Map? I saw some PRs changing stuff
here.

cc [~wesm] [~bhulette] [~paul.e.taylor]

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

2020-01-07 Thread Anthony Abate (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010293#comment-17010293
 ] 

Anthony Abate commented on ARROW-7511:
--

Now i remember why I thought Memory and Span can't support more than 2 gigs:

The *.Slice()* function only takes int32

https://docs.microsoft.com/en-us/dotnet/api/system.memory-1.slice?view=netcore-3.1#System_Memory_1_Slice_System_Int32_System_Int32_

 

> [C#] - Batch / Data Size Can't Exceed 2 gigs
> 
>
> Key: ARROW-7511
> URL: https://issues.apache.org/jira/browse/ARROW-7511
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> While the Arrow spec does not forbid batches larger than 2 gigs, the C# 
> library can not support this in its current form due to limits on managed 
> memory as it tries to put the whole batch into a single 
> Span/Memory
> It is possible to fix this by not trying to use Memory/Span/byte[] for the 
> entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
> only move the problem 'lower' as it would then still set the limit of a 
> Column Data in a single batch to be 2 Gigs.  
> This seems like plenty of memory... but if you think of strings columns, the 
> data is just one giant string appended to together with offsets and it can 
> get very large quickly.
> I think the unfortunate problem is that memory management in the C# managed 
> world is always going to hit the 2 gig limit somewhere. (please correct me if 
> I am wrong on this statement, but I thought i read some where that Memory 
> / Span are limited to int and changing to long would require major 
> framework rewrites - but i may be conflating that with array)
> That ultimately means the C# library either has to reject files of certain 
> characteristics (ie validation checks on opening) , or the spec needs put 
> upper limits on certain internal arrow constructs (ie arrow buffer) to 
> eliminate the need for more than a 2 gigs of contiguous memory for the 
> smallest arrow object.
> However, If the spec was indeed designed for the smallest buffer object to be 
> larger than 2 gigs, or for the entire memory buffer of arrow to be 
> contiguous, one has to wonder if at some point, it might just make sense for 
> the C# library to use the C++ library as its memory manager as replicating a 
> very large blocks of memory more work than its wroth.
> In any case,  this issue is more about 'deferring' the 2 gig size problem by 
> moving it down to the buffer objects... This might require some re-write of 
> the batch data structures
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 >

1 - 100 of 305 matches

Mail list logo