[jira] [Closed] (ARROW-4914) [Rust] Array slice returns incorrect bitmask

2019-09-14 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale closed ARROW-4914.
-
Resolution: Resolved

> [Rust] Array slice returns incorrect bitmask
> 
>
> Key: ARROW-4914
> URL: https://issues.apache.org/jira/browse/ARROW-4914
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Neville Dipale
>Priority: Blocker
>  Labels: beginner
>
> Slicing arrays changes the offset, length and null count of their array data, 
> but the bitmask is not changed.
> This results in the correct null count, but the array values might be marked 
> incorrectly as valid/invalid based on the old bitmask positions before the 
> offset.
> To reproduce, create an array with some null values, slice the array, and 
> then dbg!() it (after downcasting).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-3543) [R] Better support for timestamp format and time zones in R

2019-09-14 Thread Olaf (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929846#comment-16929846
 ] 

Olaf commented on ARROW-3543:
-

OP here. so we finally close to have this fixed? amazing!

> [R] Better support for timestamp format and time zones in R
> ---
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Major
> Fix For: 1.0.0
>
>
> See below for original description and reports. In sum, there is a mismatch 
> between how the C++ library and R interpret data without a timezone, and it 
> turns out that we're not passing the timezone to R if it is set in Arrow C++ 
> anyway. 
> The [C++ library 
> docs|http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE]
>  say "If a timezone-aware field contains a recognized timezone, its values 
> may be localized to that locale upon display; the values of timezone-naive 
> fields must always be displayed “as is”, with no localization performed on 
> them." But R's print default, as well as the parsing default, is the current 
> time zone: 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html
> The C++ library seems to parse timestamp strings that don't have timezone 
> information as if they are UTC, so when you read timezone-naive timestamps 
> from Arrow and print them in R, they are shifted to be localized to the 
> current timezone. If you print timestamp data from Arrow with 
> {{print(timestamp_var, tz="GMT")}} it would look as you expect.
> On further inspection, the [arrow-to-vector code for 
> timestamp|https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514]
>  doesn't seem to consider time zone information even if it does exist. So we 
> don't have the means currently in R to display timestamp data faithfully, 
> whether or not it is timezone-aware.
> Among the tasks here:
> * Include the timezone attribute in the POSIXct R vector that gets created 
> from a timestamp Arrow array
> * Ensure that timezone-naive data from Arrow is printed in R "as is" with no 
> localization 
> -
> Original description:
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure 

[jira] [Updated] (ARROW-6559) [Developer][C++] Add "archery" option to specify system toolchain for C++ builds

2019-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6559:
--
Labels: pull-request-available  (was: )

> [Developer][C++] Add "archery" option to specify system toolchain for C++ 
> builds
> 
>
> Key: ARROW-6559
> URL: https://issues.apache.org/jira/browse/ARROW-6559
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Developer Tools
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> I use toolchain directories that are found outside of conda environments. 
> It's a bit awkward to use archery to do benchmark comparisons with this 
> arrangement. I suggest adding a "--cpp-package-prefix" option or similar that 
> will set {{ARROW_DEPENDENCY_SOURCE=SYSTEM}} and the correct 
> {{ARROW_PACKAGE_PREFIX}} so this works properly



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-1741) [C++] Comparison function for DictionaryArray to determine if indices are "compatible"

2019-09-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1741.
-
Resolution: Fixed

Issue resolved by pull request 5342
[https://github.com/apache/arrow/pull/5342]

> [C++] Comparison function for DictionaryArray to determine if indices are 
> "compatible"
> --
>
> Key: ARROW-1741
> URL: https://issues.apache.org/jira/browse/ARROW-1741
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> For example, if one array's dictionary is larger than the other, but the 
> overlapping beginning portion is the same, then the respective dictionary 
> indices correspond to the same values. Therefore, in analytics, one may 
> choose to drop the smaller dictionary in favor of the larger dictionary, and 
> this need not incur any computational overhead (beyond comparing the 
> dictionary prefixes -- there may be some way to engineer "dictionary lineage" 
> to make this comparison even cheaper)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-3543) [R] Better support for timestamp format and time zones in R

2019-09-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3543:
---
Description: 
See below for original description and reports. In sum, there is a mismatch 
between how the C++ library and R interpret data without a timezone, and it 
turns out that we're not passing the timezone to R if it is set in Arrow C++ 
anyway. 

The [C++ library 
docs|http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE]
 say "If a timezone-aware field contains a recognized timezone, its values may 
be localized to that locale upon display; the values of timezone-naive fields 
must always be displayed “as is”, with no localization performed on them." But 
R's print default, as well as the parsing default, is the current time zone: 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html

The C++ library seems to parse timestamp strings that don't have timezone 
information as if they are UTC, so when you read timezone-naive timestamps from 
Arrow and print them in R, they are shifted to be localized to the current 
timezone. If you print timestamp data from Arrow with {{print(timestamp_var, 
tz="GMT")}} it would look as you expect.

On further inspection, the [arrow-to-vector code for 
timestamp|https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514]
 doesn't seem to consider time zone information even if it does exist. So we 
don't have the means currently in R to display timestamp data faithfully, 
whether or not it is timezone-aware.

Among the tasks here:

* Include the timezone attribute in the POSIXct R vector that gets created from 
a timestamp Arrow array
* Ensure that timezone-naive data from Arrow is printed in R "as is" with no 
localization 

-
Original description:

Hello the dream team,

Pasting from [https://github.com/wesm/feather/issues/351]

Thanks for this wonderful package. I was playing with feather and some 
timestamps and I noticed some dangerous behavior. Maybe it is a bug.

Consider this

 
{code:java}
import pandas as pd
import feather
import numpy as np
df = pd.DataFrame(
{'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
14:01:02.200')]}
)
df['timestamp_est'] = 
pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
df
 Out[17]: 
 string_time_utc timestamp_est
 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
{code}
Here I create the corresponding `EST` timestamp of my original timestamps (in 
`UTC` time).

Now saving the dataframe to `csv` or to `feather` will generate two completely 
different results.

 
{code:java}
df.to_csv('P://testing.csv')
df.to_feather('P://testing.feather')
{code}
Switching to R.

Using the good old `csv` gives me something a bit annoying, but expected. R 
thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
`timestamp_est`. No big deal, I can always use `with_tz` or even better: import 
as character and process as timestamp while in R.

 
{code:java}
> dataframe <- read_csv('P://testing.csv')
 Parsed with column specification:
 cols(
 X1 = col_integer(),
 string_time_utc = col_datetime(format = ""),
 timestamp_est = col_datetime(format = "")
 )
 Warning message:
 Missing column names filled in: 'X1' [1] 
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 4
 X1 string_time_utc timestamp_est 

 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
 mytimezone
  
 1 UTC 
 2 UTC 
 3 UTC  {code}
{code:java}
#Now look at what happens with feather:
 
 > dataframe <- read_feather('P://testing.feather')
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 3
 string_time_utc timestamp_est mytimezone

 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
My timestamps have been converted!!! pure insanity. 
 Am I missing something here?

Thanks!!

  was:
See below for original description and reports. In sum, there is a mismatch 
between how the C++ library and R interpret data without a timezone, and it 
turns out that we're not passing the timezone to R if it is set in Arrow C++ 
anyway. 

The C++ library docs say "If a timezone-aware field contains a recognized 
timezone, its values may be localized to that locale upon display; the values 
of timezone-naive fields must always be displayed “as is”, with no localization 
performed on them." But R's print default is the current time zone: 

[jira] [Updated] (ARROW-3543) [R] Better support for timestamp format and time zones in R

2019-09-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3543:
---
Description: 
See below for original description and reports. In sum, there is a mismatch 
between how the C++ library and R interpret data without a timezone, and it 
turns out that we're not passing the timezone to R if it is set in Arrow C++ 
anyway. 

The C++ library docs say "If a timezone-aware field contains a recognized 
timezone, its values may be localized to that locale upon display; the values 
of timezone-naive fields must always be displayed “as is”, with no localization 
performed on them." But R's print default is the current time zone: 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html

My guess is that readr::read_delim interprets timestamps without a timezone to 
be the current time zone, but arrow C++ interprets that as UTC, which becomes a 
problem when R tries to print the timestamp.

I'm guessing that if you did print(df$Date, tz="GMT") it would look as you 
expect.

Other fun fact I saw while digging in: the arrow-to-vector code for timestamp 
doesn't seem to consider time zone information if it does exist, so we should 
handle that too. 
https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514

-
Original description:

Hello the dream team,

Pasting from [https://github.com/wesm/feather/issues/351]

Thanks for this wonderful package. I was playing with feather and some 
timestamps and I noticed some dangerous behavior. Maybe it is a bug.

Consider this

 
{code:java}
import pandas as pd
import feather
import numpy as np
df = pd.DataFrame(
{'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
14:01:02.200')]}
)
df['timestamp_est'] = 
pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
df
 Out[17]: 
 string_time_utc timestamp_est
 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
{code}
Here I create the corresponding `EST` timestamp of my original timestamps (in 
`UTC` time).

Now saving the dataframe to `csv` or to `feather` will generate two completely 
different results.

 
{code:java}
df.to_csv('P://testing.csv')
df.to_feather('P://testing.feather')
{code}
Switching to R.

Using the good old `csv` gives me something a bit annoying, but expected. R 
thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
`timestamp_est`. No big deal, I can always use `with_tz` or even better: import 
as character and process as timestamp while in R.

 
{code:java}
> dataframe <- read_csv('P://testing.csv')
 Parsed with column specification:
 cols(
 X1 = col_integer(),
 string_time_utc = col_datetime(format = ""),
 timestamp_est = col_datetime(format = "")
 )
 Warning message:
 Missing column names filled in: 'X1' [1] 
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 4
 X1 string_time_utc timestamp_est 

 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
 mytimezone
  
 1 UTC 
 2 UTC 
 3 UTC  {code}
{code:java}
#Now look at what happens with feather:
 
 > dataframe <- read_feather('P://testing.feather')
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 3
 string_time_utc timestamp_est mytimezone

 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
My timestamps have been converted!!! pure insanity. 
 Am I missing something here?

Thanks!!

  was:
Hello the dream team,

Pasting from [https://github.com/wesm/feather/issues/351]

Thanks for this wonderful package. I was playing with feather and some 
timestamps and I noticed some dangerous behavior. Maybe it is a bug.

Consider this

 
{code:java}
import pandas as pd
import feather
import numpy as np
df = pd.DataFrame(
{'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
14:01:02.200')]}
)
df['timestamp_est'] = 
pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
df
 Out[17]: 
 string_time_utc timestamp_est
 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
{code}
Here I create the corresponding `EST` timestamp of my original timestamps (in 
`UTC` time).

Now saving the dataframe to `csv` or to `feather` will generate two completely 
different results.

 
{code:java}
df.to_csv('P://testing.csv')

[jira] [Updated] (ARROW-3543) [R] Better support for timestamp format and time zones in R

2019-09-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3543:
---
Summary: [R] Better support for timestamp format and time zones in R  (was: 
[R] Time zone adjustment issue when reading Feather file written by Python)

> [R] Better support for timestamp format and time zones in R
> ---
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Major
> Fix For: 1.0.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-4208) [CI/Python] Have automatized tests for S3

2019-09-14 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929826#comment-16929826
 ] 

Rok Mihevc commented on ARROW-4208:
---

With #[5200|https://github.com/apache/arrow/pull/5200] we get minio server in 
pytest via fixture. It will work run in docker images, travis and appveyor.
So far we have one 
[test|https://github.com/apache/arrow/blob/59f1e148d5c0fa13b7964f85f13011532ff515ed/python/pyarrow/tests/test_parquet.py#L1797]
 in python.

Do we want to add other tests now? Do we have regression examples?

> [CI/Python] Have automatized tests for S3
> -
>
> Key: ARROW-4208
> URL: https://issues.apache.org/jira/browse/ARROW-4208
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: filesystem, pull-request-available, s3
> Fix For: 1.0.0
>
>
> Currently We don't run S3 integration tests regularly. 
> Possible solutions:
> - mock it within python/pytest
> - simply run the s3 tests with an S3 credential provided
> - create a hdfs-integration like docker-compose setup and run an S3 mock 
> server (e.g.: https://github.com/adobe/S3Mock, 
> https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, 
> https://github.com/jserver/mock-s3)
> For more see discussion https://github.com/apache/arrow/pull/3286



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6562) [GLib] Fix wrong sliced data of GArrowBuffer

2019-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6562:
--
Labels: pull-request-available  (was: )

> [GLib] Fix wrong sliced data of GArrowBuffer
> 
>
> Key: ARROW-6562
> URL: https://issues.apache.org/jira/browse/ARROW-6562
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6562) [GLib] Fix wrong sliced data of GArrowBuffer

2019-09-14 Thread Sutou Kouhei (Jira)
Sutou Kouhei created ARROW-6562:
---

 Summary: [GLib] Fix wrong sliced data of GArrowBuffer
 Key: ARROW-6562
 URL: https://issues.apache.org/jira/browse/ARROW-6562
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3

2019-09-14 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-4208:
--
Labels: filesystem pull-request-available s3  (was: filesystem s3)

> [CI/Python] Have automatized tests for S3
> -
>
> Key: ARROW-4208
> URL: https://issues.apache.org/jira/browse/ARROW-4208
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: filesystem, pull-request-available, s3
> Fix For: 1.0.0
>
>
> Currently We don't run S3 integration tests regularly. 
> Possible solutions:
> - mock it within python/pytest
> - simply run the s3 tests with an S3 credential provided
> - create a hdfs-integration like docker-compose setup and run an S3 mock 
> server (e.g.: https://github.com/adobe/S3Mock, 
> https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, 
> https://github.com/jserver/mock-s3)
> For more see discussion https://github.com/apache/arrow/pull/3286



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6090) [Rust] [DataFusion] Implement parallel execution for hash aggregate

2019-09-14 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6090:
--
Fix Version/s: 0.15.0

> [Rust] [DataFusion] Implement parallel execution for hash aggregate
> ---
>
> Key: ARROW-6090
> URL: https://issues.apache.org/jira/browse/ARROW-6090
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

2019-09-14 Thread Shannon C Lewis (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929781#comment-16929781
 ] 

Shannon C Lewis commented on ARROW-3543:


Hey Neal,

yep, when I run print(arrow_log$datetime, tz="GMT") i'm getting the correct 
datetime:
> print(arrow_log$DateTime,tz="GMT")
 [1] "2019-09-11 21:36:22 GMT" "2019-09-11 21:36:22 GMT" "2019-09-11 22:43:58 
GMT" "2019-09-11 22:43:58 GMT" "2019-09-11 23:11:39 GMT" "2019-09-12 00:36:22 
GMT"
 [7] "2019-09-12 00:36:22 GMT" "2019-09-12 00:43:58 GMT" "2019-09-12 00:43:58 
GMT" "2019-09-12 01:11:39 GMT

I also tested this with saving with python to feather and reading with both 
feather and arrow:
> print(py_feather$DateTime,tz="GMT")
 [1] "2019-09-11 21:36:22 GMT" "2019-09-11 21:36:22 GMT" "2019-09-11 22:43:58 
GMT" "2019-09-11 22:43:58 GMT" "2019-09-11 23:11:39 GMT" "2019-09-12 00:36:22 
GMT"
 [7] "2019-09-12 00:36:22 GMT" "2019-09-12 00:43:58 GMT" "2019-09-12 00:43:58 
GMT" "2019-09-12 01:11:39 GMT"

So it seem like you have identified the issue :)


> [R] Time zone adjustment issue when reading Feather file written by Python
> --
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Major
> Fix For: 1.0.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6561) [Python] pandas-master integration test failure

2019-09-14 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6561:
---

 Summary: [Python] pandas-master integration test failure
 Key: ARROW-6561
 URL: https://issues.apache.org/jira/browse/ARROW-6561
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


{code}
=== FAILURES ===
_ test_array_protocol __

def test_array_protocol():
if LooseVersion(pd.__version__) < '0.24.0':
pytest.skip('IntegerArray only introduced in 0.24')

def __arrow_array__(self, type=None):
return pa.array(self._data, mask=self._mask, type=type)

df = pd.DataFrame({'a': pd.Series([1, 2, None], dtype='Int64')})

# with latest pandas/arrow, trying to convert nullable integer errors
with pytest.raises(TypeError):
>   pa.table(df)
E   Failed: DID NOT RAISE 

opt/conda/lib/python3.6/site-packages/pyarrow/tests/test_pandas.py:3035: Failed
{code}

https://circleci.com/gh/ursa-labs/crossbow/2896?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6560) [Python] Failures in *-nopandas integration tests

2019-09-14 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6560:
---

 Summary: [Python] Failures in *-nopandas integration tests
 Key: ARROW-6560
 URL: https://issues.apache.org/jira/browse/ARROW-6560
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6509) [C++][Gandiva] Re-enable Gandiva JNI tests and fix Travis CI failure

2019-09-14 Thread Pindikura Ravindra (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra reassigned ARROW-6509:
-

Assignee: Prudhvi Porandla

> [C++][Gandiva] Re-enable Gandiva JNI tests and fix Travis CI failure
> 
>
> Key: ARROW-6509
> URL: https://issues.apache.org/jira/browse/ARROW-6509
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Antoine Pitrou
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This seems to happen more or less frequently on the Python - Java build (with 
> jpype enabled).
> See warnings and errors starting from 
> https://travis-ci.org/apache/arrow/jobs/583069089#L6662



--
This message was sent by Atlassian Jira
(v8.3.2#803003)