[jira] [Closed] (ARROW-4914) [Rust] Array slice returns incorrect bitmask
[ https://issues.apache.org/jira/browse/ARROW-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale closed ARROW-4914. - Resolution: Resolved > [Rust] Array slice returns incorrect bitmask > > > Key: ARROW-4914 > URL: https://issues.apache.org/jira/browse/ARROW-4914 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.13.0 >Reporter: Neville Dipale >Priority: Blocker > Labels: beginner > > Slicing arrays changes the offset, length and null count of their array data, > but the bitmask is not changed. > This results in the correct null count, but the array values might be marked > incorrectly as valid/invalid based on the old bitmask positions before the > offset. > To reproduce, create an array with some null values, slice the array, and > then dbg!() it (after downcasting). -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-3543) [R] Better support for timestamp format and time zones in R
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929846#comment-16929846 ] Olaf commented on ARROW-3543: - OP here. so we finally close to have this fixed? amazing! > [R] Better support for timestamp format and time zones in R > --- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Olaf >Priority: Major > Fix For: 1.0.0 > > > See below for original description and reports. In sum, there is a mismatch > between how the C++ library and R interpret data without a timezone, and it > turns out that we're not passing the timezone to R if it is set in Arrow C++ > anyway. > The [C++ library > docs|http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE] > say "If a timezone-aware field contains a recognized timezone, its values > may be localized to that locale upon display; the values of timezone-naive > fields must always be displayed “as is”, with no localization performed on > them." But R's print default, as well as the parsing default, is the current > time zone: > https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html > The C++ library seems to parse timestamp strings that don't have timezone > information as if they are UTC, so when you read timezone-naive timestamps > from Arrow and print them in R, they are shifted to be localized to the > current timezone. If you print timestamp data from Arrow with > {{print(timestamp_var, tz="GMT")}} it would look as you expect. > On further inspection, the [arrow-to-vector code for > timestamp|https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514] > doesn't seem to consider time zone information even if it does exist. So we > don't have the means currently in R to display timestamp data faithfully, > whether or not it is timezone-aware. > Among the tasks here: > * Include the timezone attribute in the POSIXct R vector that gets created > from a timestamp Arrow array > * Ensure that timezone-naive data from Arrow is printed in R "as is" with no > localization > - > Original description: > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure
[jira] [Updated] (ARROW-6559) [Developer][C++] Add "archery" option to specify system toolchain for C++ builds
[ https://issues.apache.org/jira/browse/ARROW-6559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6559: -- Labels: pull-request-available (was: ) > [Developer][C++] Add "archery" option to specify system toolchain for C++ > builds > > > Key: ARROW-6559 > URL: https://issues.apache.org/jira/browse/ARROW-6559 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Developer Tools >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > I use toolchain directories that are found outside of conda environments. > It's a bit awkward to use archery to do benchmark comparisons with this > arrangement. I suggest adding a "--cpp-package-prefix" option or similar that > will set {{ARROW_DEPENDENCY_SOURCE=SYSTEM}} and the correct > {{ARROW_PACKAGE_PREFIX}} so this works properly -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-1741) [C++] Comparison function for DictionaryArray to determine if indices are "compatible"
[ https://issues.apache.org/jira/browse/ARROW-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1741. - Resolution: Fixed Issue resolved by pull request 5342 [https://github.com/apache/arrow/pull/5342] > [C++] Comparison function for DictionaryArray to determine if indices are > "compatible" > -- > > Key: ARROW-1741 > URL: https://issues.apache.org/jira/browse/ARROW-1741 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Benjamin Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > For example, if one array's dictionary is larger than the other, but the > overlapping beginning portion is the same, then the respective dictionary > indices correspond to the same values. Therefore, in analytics, one may > choose to drop the smaller dictionary in favor of the larger dictionary, and > this need not incur any computational overhead (beyond comparing the > dictionary prefixes -- there may be some way to engineer "dictionary lineage" > to make this comparison even cheaper) -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-3543) [R] Better support for timestamp format and time zones in R
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-3543: --- Description: See below for original description and reports. In sum, there is a mismatch between how the C++ library and R interpret data without a timezone, and it turns out that we're not passing the timezone to R if it is set in Arrow C++ anyway. The [C++ library docs|http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE] say "If a timezone-aware field contains a recognized timezone, its values may be localized to that locale upon display; the values of timezone-naive fields must always be displayed “as is”, with no localization performed on them." But R's print default, as well as the parsing default, is the current time zone: https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html The C++ library seems to parse timestamp strings that don't have timezone information as if they are UTC, so when you read timezone-naive timestamps from Arrow and print them in R, they are shifted to be localized to the current timezone. If you print timestamp data from Arrow with {{print(timestamp_var, tz="GMT")}} it would look as you expect. On further inspection, the [arrow-to-vector code for timestamp|https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514] doesn't seem to consider time zone information even if it does exist. So we don't have the means currently in R to display timestamp data faithfully, whether or not it is timezone-aware. Among the tasks here: * Include the timezone attribute in the POSIXct R vector that gets created from a timestamp Arrow array * Ensure that timezone-naive data from Arrow is printed in R "as is" with no localization - Original description: Hello the dream team, Pasting from [https://github.com/wesm/feather/issues/351] Thanks for this wonderful package. I was playing with feather and some timestamps and I noticed some dangerous behavior. Maybe it is a bug. Consider this {code:java} import pandas as pd import feather import numpy as np df = pd.DataFrame( {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 14:01:02.200')]} ) df['timestamp_est'] = pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) df Out[17]: string_time_utc timestamp_est 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 {code} Here I create the corresponding `EST` timestamp of my original timestamps (in `UTC` time). Now saving the dataframe to `csv` or to `feather` will generate two completely different results. {code:java} df.to_csv('P://testing.csv') df.to_feather('P://testing.feather') {code} Switching to R. Using the good old `csv` gives me something a bit annoying, but expected. R thinks my timezone is `UTC` by default, and wrongly attached this timezone to `timestamp_est`. No big deal, I can always use `with_tz` or even better: import as character and process as timestamp while in R. {code:java} > dataframe <- read_csv('P://testing.csv') Parsed with column specification: cols( X1 = col_integer(), string_time_utc = col_datetime(format = ""), timestamp_est = col_datetime(format = "") ) Warning message: Missing column names filled in: 'X1' [1] > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) A tibble: 3 x 4 X1 string_time_utc timestamp_est 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 mytimezone 1 UTC 2 UTC 3 UTC {code} {code:java} #Now look at what happens with feather: > dataframe <- read_feather('P://testing.feather') > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) A tibble: 3 x 3 string_time_utc timestamp_est mytimezone 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} My timestamps have been converted!!! pure insanity. Am I missing something here? Thanks!! was: See below for original description and reports. In sum, there is a mismatch between how the C++ library and R interpret data without a timezone, and it turns out that we're not passing the timezone to R if it is set in Arrow C++ anyway. The C++ library docs say "If a timezone-aware field contains a recognized timezone, its values may be localized to that locale upon display; the values of timezone-naive fields must always be displayed “as is”, with no localization performed on them." But R's print default is the current time zone:
[jira] [Updated] (ARROW-3543) [R] Better support for timestamp format and time zones in R
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-3543: --- Description: See below for original description and reports. In sum, there is a mismatch between how the C++ library and R interpret data without a timezone, and it turns out that we're not passing the timezone to R if it is set in Arrow C++ anyway. The C++ library docs say "If a timezone-aware field contains a recognized timezone, its values may be localized to that locale upon display; the values of timezone-naive fields must always be displayed “as is”, with no localization performed on them." But R's print default is the current time zone: https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html My guess is that readr::read_delim interprets timestamps without a timezone to be the current time zone, but arrow C++ interprets that as UTC, which becomes a problem when R tries to print the timestamp. I'm guessing that if you did print(df$Date, tz="GMT") it would look as you expect. Other fun fact I saw while digging in: the arrow-to-vector code for timestamp doesn't seem to consider time zone information if it does exist, so we should handle that too. https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514 - Original description: Hello the dream team, Pasting from [https://github.com/wesm/feather/issues/351] Thanks for this wonderful package. I was playing with feather and some timestamps and I noticed some dangerous behavior. Maybe it is a bug. Consider this {code:java} import pandas as pd import feather import numpy as np df = pd.DataFrame( {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 14:01:02.200')]} ) df['timestamp_est'] = pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) df Out[17]: string_time_utc timestamp_est 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 {code} Here I create the corresponding `EST` timestamp of my original timestamps (in `UTC` time). Now saving the dataframe to `csv` or to `feather` will generate two completely different results. {code:java} df.to_csv('P://testing.csv') df.to_feather('P://testing.feather') {code} Switching to R. Using the good old `csv` gives me something a bit annoying, but expected. R thinks my timezone is `UTC` by default, and wrongly attached this timezone to `timestamp_est`. No big deal, I can always use `with_tz` or even better: import as character and process as timestamp while in R. {code:java} > dataframe <- read_csv('P://testing.csv') Parsed with column specification: cols( X1 = col_integer(), string_time_utc = col_datetime(format = ""), timestamp_est = col_datetime(format = "") ) Warning message: Missing column names filled in: 'X1' [1] > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) A tibble: 3 x 4 X1 string_time_utc timestamp_est 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 mytimezone 1 UTC 2 UTC 3 UTC {code} {code:java} #Now look at what happens with feather: > dataframe <- read_feather('P://testing.feather') > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) A tibble: 3 x 3 string_time_utc timestamp_est mytimezone 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} My timestamps have been converted!!! pure insanity. Am I missing something here? Thanks!! was: Hello the dream team, Pasting from [https://github.com/wesm/feather/issues/351] Thanks for this wonderful package. I was playing with feather and some timestamps and I noticed some dangerous behavior. Maybe it is a bug. Consider this {code:java} import pandas as pd import feather import numpy as np df = pd.DataFrame( {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 14:01:02.200')]} ) df['timestamp_est'] = pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) df Out[17]: string_time_utc timestamp_est 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 {code} Here I create the corresponding `EST` timestamp of my original timestamps (in `UTC` time). Now saving the dataframe to `csv` or to `feather` will generate two completely different results. {code:java} df.to_csv('P://testing.csv')
[jira] [Updated] (ARROW-3543) [R] Better support for timestamp format and time zones in R
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-3543: --- Summary: [R] Better support for timestamp format and time zones in R (was: [R] Time zone adjustment issue when reading Feather file written by Python) > [R] Better support for timestamp format and time zones in R > --- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Olaf >Priority: Major > Fix For: 1.0.0 > > > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure insanity. > Am I missing something here? > Thanks!! -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-4208) [CI/Python] Have automatized tests for S3
[ https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929826#comment-16929826 ] Rok Mihevc commented on ARROW-4208: --- With #[5200|https://github.com/apache/arrow/pull/5200] we get minio server in pytest via fixture. It will work run in docker images, travis and appveyor. So far we have one [test|https://github.com/apache/arrow/blob/59f1e148d5c0fa13b7964f85f13011532ff515ed/python/pyarrow/tests/test_parquet.py#L1797] in python. Do we want to add other tests now? Do we have regression examples? > [CI/Python] Have automatized tests for S3 > - > > Key: ARROW-4208 > URL: https://issues.apache.org/jira/browse/ARROW-4208 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Assignee: Rok Mihevc >Priority: Major > Labels: filesystem, pull-request-available, s3 > Fix For: 1.0.0 > > > Currently We don't run S3 integration tests regularly. > Possible solutions: > - mock it within python/pytest > - simply run the s3 tests with an S3 credential provided > - create a hdfs-integration like docker-compose setup and run an S3 mock > server (e.g.: https://github.com/adobe/S3Mock, > https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, > https://github.com/jserver/mock-s3) > For more see discussion https://github.com/apache/arrow/pull/3286 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6562) [GLib] Fix wrong sliced data of GArrowBuffer
[ https://issues.apache.org/jira/browse/ARROW-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6562: -- Labels: pull-request-available (was: ) > [GLib] Fix wrong sliced data of GArrowBuffer > > > Key: ARROW-6562 > URL: https://issues.apache.org/jira/browse/ARROW-6562 > Project: Apache Arrow > Issue Type: Bug >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6562) [GLib] Fix wrong sliced data of GArrowBuffer
Sutou Kouhei created ARROW-6562: --- Summary: [GLib] Fix wrong sliced data of GArrowBuffer Key: ARROW-6562 URL: https://issues.apache.org/jira/browse/ARROW-6562 Project: Apache Arrow Issue Type: Bug Reporter: Sutou Kouhei Assignee: Sutou Kouhei -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3
[ https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc updated ARROW-4208: -- Labels: filesystem pull-request-available s3 (was: filesystem s3) > [CI/Python] Have automatized tests for S3 > - > > Key: ARROW-4208 > URL: https://issues.apache.org/jira/browse/ARROW-4208 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Assignee: Rok Mihevc >Priority: Major > Labels: filesystem, pull-request-available, s3 > Fix For: 1.0.0 > > > Currently We don't run S3 integration tests regularly. > Possible solutions: > - mock it within python/pytest > - simply run the s3 tests with an S3 credential provided > - create a hdfs-integration like docker-compose setup and run an S3 mock > server (e.g.: https://github.com/adobe/S3Mock, > https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, > https://github.com/jserver/mock-s3) > For more see discussion https://github.com/apache/arrow/pull/3286 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6090) [Rust] [DataFusion] Implement parallel execution for hash aggregate
[ https://issues.apache.org/jira/browse/ARROW-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-6090: -- Fix Version/s: 0.15.0 > [Rust] [DataFusion] Implement parallel execution for hash aggregate > --- > > Key: ARROW-6090 > URL: https://issues.apache.org/jira/browse/ARROW-6090 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python
[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929781#comment-16929781 ] Shannon C Lewis commented on ARROW-3543: Hey Neal, yep, when I run print(arrow_log$datetime, tz="GMT") i'm getting the correct datetime: > print(arrow_log$DateTime,tz="GMT") [1] "2019-09-11 21:36:22 GMT" "2019-09-11 21:36:22 GMT" "2019-09-11 22:43:58 GMT" "2019-09-11 22:43:58 GMT" "2019-09-11 23:11:39 GMT" "2019-09-12 00:36:22 GMT" [7] "2019-09-12 00:36:22 GMT" "2019-09-12 00:43:58 GMT" "2019-09-12 00:43:58 GMT" "2019-09-12 01:11:39 GMT I also tested this with saving with python to feather and reading with both feather and arrow: > print(py_feather$DateTime,tz="GMT") [1] "2019-09-11 21:36:22 GMT" "2019-09-11 21:36:22 GMT" "2019-09-11 22:43:58 GMT" "2019-09-11 22:43:58 GMT" "2019-09-11 23:11:39 GMT" "2019-09-12 00:36:22 GMT" [7] "2019-09-12 00:36:22 GMT" "2019-09-12 00:43:58 GMT" "2019-09-12 00:43:58 GMT" "2019-09-12 01:11:39 GMT" So it seem like you have identified the issue :) > [R] Time zone adjustment issue when reading Feather file written by Python > -- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Olaf >Priority: Major > Fix For: 1.0.0 > > > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure insanity. > Am I missing something here? > Thanks!! -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6561) [Python] pandas-master integration test failure
Wes McKinney created ARROW-6561: --- Summary: [Python] pandas-master integration test failure Key: ARROW-6561 URL: https://issues.apache.org/jira/browse/ARROW-6561 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.15.0 {code} === FAILURES === _ test_array_protocol __ def test_array_protocol(): if LooseVersion(pd.__version__) < '0.24.0': pytest.skip('IntegerArray only introduced in 0.24') def __arrow_array__(self, type=None): return pa.array(self._data, mask=self._mask, type=type) df = pd.DataFrame({'a': pd.Series([1, 2, None], dtype='Int64')}) # with latest pandas/arrow, trying to convert nullable integer errors with pytest.raises(TypeError): > pa.table(df) E Failed: DID NOT RAISE opt/conda/lib/python3.6/site-packages/pyarrow/tests/test_pandas.py:3035: Failed {code} https://circleci.com/gh/ursa-labs/crossbow/2896?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6560) [Python] Failures in *-nopandas integration tests
Wes McKinney created ARROW-6560: --- Summary: [Python] Failures in *-nopandas integration tests Key: ARROW-6560 URL: https://issues.apache.org/jira/browse/ARROW-6560 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.15.0 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-6509) [C++][Gandiva] Re-enable Gandiva JNI tests and fix Travis CI failure
[ https://issues.apache.org/jira/browse/ARROW-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra reassigned ARROW-6509: - Assignee: Prudhvi Porandla > [C++][Gandiva] Re-enable Gandiva JNI tests and fix Travis CI failure > > > Key: ARROW-6509 > URL: https://issues.apache.org/jira/browse/ARROW-6509 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Java >Reporter: Antoine Pitrou >Assignee: Prudhvi Porandla >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > This seems to happen more or less frequently on the Python - Java build (with > jpype enabled). > See warnings and errors starting from > https://travis-ci.org/apache/arrow/jobs/583069089#L6662 -- This message was sent by Atlassian Jira (v8.3.2#803003)