Re: [I] SQL: Ambiguous reference when aliasing in combination with ORDER BY [arrow-datafusion]

2023-12-04 Thread via GitHub


Asura7969 commented on issue #8391:
URL: 
https://github.com/apache/arrow-datafusion/issues/8391#issuecomment-1840186240

   Do you have any plans to fix this bug? If you don't have time I'll follow up 
@Jesse-Bakker 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38923: [GLib] Fix spelling [arrow]

2023-12-04 Thread via GitHub


kou commented on code in PR #38924:
URL: https://github.com/apache/arrow/pull/38924#discussion_r1415033694


##
c_glib/test/test-array.rb:
##
@@ -141,12 +141,12 @@ def test_no_diff
 
 def test_diff
   array = build_string_array(["Start", "Shutdown", "Reboot"])
-  other_array = build_string_array(["Start", "Shutdonw", "Reboot"])
+  other_array = build_string_array(["Start", "Shutdown_", "Reboot"])
   assert_equal(<<-STRING.chomp, array.diff_unified(other_array))
 
 @@ -1, +1 @@
 -"Shutdown"
-+"Shutdonw"
++"Shutdown_"

Review Comment:
   I took a look at this test again.
   We should not change this because it's a test to detect difference of two 
arrays. The "Shutdonw" typo is for the difference.
   
   Or we can use other entirely different word something like:
   
   ```diff
   diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb
   index c03aecf17..4da641b20 100644
   --- a/c_glib/test/test-array.rb
   +++ b/c_glib/test/test-array.rb
   @@ -141,12 +141,12 @@ class TestArray < Test::Unit::TestCase

def test_diff
  array = build_string_array(["Start", "Shutdown", "Reboot"])
   -  other_array = build_string_array(["Start", "Shutdonw", "Reboot"])
   +  other_array = build_string_array(["Start", "Running", "Reboot"])
  assert_equal(<<-STRING.chomp, array.diff_unified(other_array))

@@ -1, +1 @@
-"Shutdown"
   -+"Shutdonw"
   ++"Running"

  STRING
end
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][FS][Azure] Should `GetFileInfo()` against a directory always return true without hierarchical namespace support? [arrow]

2023-12-04 Thread via GitHub


Tom-Newton commented on issue #38772:
URL: https://github.com/apache/arrow/issues/38772#issuecomment-1840166133

   > @felipecrv @Tom-Newton Do either of you want to work on this?
   
   I might be able to do it in about a week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java][FlightSQL] In authentication, how does flight server get the IP/Hostname of ADBC client [arrow]

2023-12-04 Thread via GitHub


xinyiZzz commented on issue #38911:
URL: https://github.com/apache/arrow/issues/38911#issuecomment-1840114848

   I hope for help, thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] cretae s3 fialed [arrow-datafusion]

2023-12-04 Thread via GitHub


Jefffrey commented on issue #8421:
URL: 
https://github.com/apache/arrow-datafusion/issues/8421#issuecomment-1840107703

   Looks to be a regression introduced by 
https://github.com/apache/arrow-datafusion/pull/7390
   
   Specifically:
   
   
https://github.com/apache/arrow-datafusion/blob/d1554c85c0da2342c5247ed22a728617e8f69142/datafusion/common/src/file_options/parquet_writer.rs#L208
   
   Can probably relax this strict check to allow unknown keys since they could 
be used for other stuff, like S3 object store
   
   Same could potentially be done for JSON and CSV too:
   
   
https://github.com/apache/arrow-datafusion/blob/d1554c85c0da2342c5247ed22a728617e8f69142/datafusion/common/src/file_options/json_writer.rs#L54
   
   
https://github.com/apache/arrow-datafusion/blob/d1554c85c0da2342c5247ed22a728617e8f69142/datafusion/common/src/file_options/csv_writer.rs#L100


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Failed to build docker image [arrow-ballista]

2023-12-04 Thread via GitHub


monkeymq commented on issue #912:
URL: https://github.com/apache/arrow-ballista/issues/912#issuecomment-1840091400

   @zhangyuan Thank you for your suggestion that solved part of my problem, I 
say this because I also need to set permissions to some scripts before 
performing any steps (after git clone of course), i.e. `chmod a+rwx 
./dev/docker/* .sh`, this works for me.
   
   BTW, documentation for Ballista is a bit old and incorrect, which puts some 
people off.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-39050: [C++] Use Cast() instead of CastTo() for Timestamp Scalar in test [arrow]

2023-12-04 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #39060:
URL: https://github.com/apache/arrow/pull/39060#issuecomment-1840053073

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have 
been run so far on merge-commit c39a2230b931a009c34e2a4885b453302c02b233.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/19315635400) 
has more details. It also includes information about 1 possible false positive 
for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] String compare for Substrait roundtrip check [arrow-datafusion]

2023-12-04 Thread via GitHub


tgujar commented on issue #8361:
URL: 
https://github.com/apache/arrow-datafusion/issues/8361#issuecomment-1840047758

   Plan 1 is the produced Logical plan, and Plan 2 is Plan1 converted to 
substrait and then converted back to Logical plan. I think the issue arises 
because of differences in what can be expressed by substrait grammar vs the 
logical plan generated by Datafusion. For tests where the plans differ I think 
using 
[assert_expected_plan](https://github.com/apache/arrow-datafusion/blob/aeb012e78115bf69d9a58407ac24bc20bd9e0bf0/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L781C10-L781C30)
 instead of 
[roundtrip_with_ctx](https://github.com/apache/arrow-datafusion/blob/aeb012e78115bf69d9a58407ac24bc20bd9e0bf0/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L832)
 should work after manual inspection. I think I want to look into this further 
but maybe we could normalize both plans somehow and then check for equality


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38705: [C++][FS][Azure] Implement CopyFile() [arrow]

2023-12-04 Thread via GitHub


kou commented on PR #39058:
URL: https://github.com/apache/arrow/pull/39058#issuecomment-1840039735

   Updated:
   * Accept only blob path for destination like other implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Bump actions/labeler from 4.3.0 to 5.0.0 [arrow-ballista]

2023-12-04 Thread via GitHub


dependabot[bot] commented on PR #923:
URL: https://github.com/apache/arrow-ballista/pull/923#issuecomment-1840015487

   The following labels could not be found: `auto-dependencies`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Bump actions/labeler from 4.3.0 to 5.0.0 [arrow-ballista]

2023-12-04 Thread via GitHub


dependabot[bot] opened a new pull request, #923:
URL: https://github.com/apache/arrow-ballista/pull/923

   Bumps [actions/labeler](https://github.com/actions/labeler) from 4.3.0 to 
5.0.0.
   
   Release notes
   Sourced from https://github.com/actions/labeler/releases";>actions/labeler's 
releases.
   
   v5.0.0
   What's Changed
   This release contains the following breaking changes:
   
   
   The ability to apply labels based on the names of base and/or head 
branches was added (https://redirect.github.com/actions/labeler/issues/186";>#186 and https://redirect.github.com/actions/labeler/issues/54";>#54). The 
match object for changed files was expanded with new combinations in order to 
make it more intuitive and flexible (https://redirect.github.com/actions/labeler/issues/423";>#423 and https://redirect.github.com/actions/labeler/issues/101";>#101). As a 
result, the configuration file structure was significantly redesigned and is 
not compatible with the structure of the previous version. Please read the https://github.com/actions/labeler/tree/main#pull-request-labeler";>action 
documentation to find out how to adapt your configuration files for use 
with the new action version.
   
   
   The bug related to the sync-labels input was fixed (https://redirect.github.com/actions/labeler/issues/112";>#112). Now 
the input value is read correctly.
   
   
   By default, dot input is set to true. Now, 
paths starting with a dot (e.g. .github) are matched by 
default.
   
   
   Version 5 of this action updated the https://docs.github.com/en/actions/creating-actions/metadata-syntax-for-github-actions#runs-for-javascript-actions";>runtime
 to Node.js 20. All scripts are now run with Node.js 20 instead of Node.js 
16 and are affected by any breaking changes between Node.js 16 and 20.
   
   
   For more information, please read the https://github.com/actions/labeler/tree/main#pull-request-labeler";>action 
documentation.
   New Contributors
   
   https://github.com/joshdales";>@​joshdales made 
their first contribution in https://redirect.github.com/actions/labeler/pull/203";>actions/labeler#203
   https://github.com/dusan-trickovic";>@​dusan-trickovic 
made their first contribution in https://redirect.github.com/actions/labeler/pull/626";>actions/labeler#626
   https://github.com/sungh0lim";>@​sungh0lim made 
their first contribution in https://redirect.github.com/actions/labeler/pull/630";>actions/labeler#630
   https://github.com/TrianguloY";>@​TrianguloY 
made their first contribution in https://redirect.github.com/actions/labeler/pull/629";>actions/labeler#629
   
   Full Changelog: https://github.com/actions/labeler/compare/v4...v5.0.0";>https://github.com/actions/labeler/compare/v4...v5.0.0
   v5.0.0-beta.1
   What's Changed
   In scope of this beta release, the structure of the configuration file 
(.github/labeler.yml) was changed from
   LabelName:
   - any:
 - changed-files: ['list', 'of', 'globs']
 - base-branch: ['list', 'of', 'regexps']
 - head-branch: ['list', 'of', 'regexps']
   - all:
 - changed-files: ['list', 'of', 'globs']
 - base-branch: ['list', 'of', 'regexps']
 - head-branch: ['list', 'of', 'regexps']
   
   to
   LabelName:
   - any:
 - changed-files: 
   - AnyGlobToAnyFile: ['list', 'of', 'globs']
   - AnyGlobToAllFiles: ['list', 'of', 'globs']
   - AllGlobsToAnyFile: ['list', 'of', 'globs']
   - AllGlobsToAllFiles: ['list', 'of', 'globs']
 - base-branch: ['list', 'of', 'regexps']
 - head-branch: ['list', 'of', 'regexps']
   - all:
 - changed-files:
   - AnyGlobToAnyFile: ['list', 'of', 'globs']
   - AnyGlobToAllFiles: ['list', 'of', 'globs']
   - AllGlobsToAnyFile: ['list', 'of', 'globs']
    
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/actions/labeler/commit/8558fd74291d67161a8a78ce36a881fa63b766a9";>8558fd7
 Merge pull request https://redirect.github.com/actions/labeler/issues/709";>#709 from 
actions/v5.0.0-beta
   https://github.com/actions/labeler/commit/000ca75fe6c5838c790ca73b764419065c1594a6";>000ca75
 Merge pull request https://redirect.github.com/actions/labeler/issues/700";>#700 from 
MaksimZhukov/apply-suggestions-and-update-docume...
   https://github.com/actions/labeler/commit/cb66c2f0788d382da1dabd06a094c0bc6ed3e26a";>cb66c2f
 Update dist
   https://github.com/actions/labeler/commit/9181355e36dc8e434c93ba5aaa33f699c4162f38";>9181355
 Apply suggestions for the beta vesrion and update the documentation
   https://github.com/actions/labeler/commit/efe4c1c90edf0ec238b5ee13e66e1abcbbe7446e";>efe4c1c
 Merge pull request https://redirect.github.com/actions/labeler/issues/699";>#699 from 
MaksimZhukov/update-node-runtime-and-dependencies
   https://github.com/actions/labeler/commit/c0957ad7c30fb0638e275122d51df2330459854a";>c0957ad
 Run Prettier
   https://github.com/actions/labeler/commit/8dc8d1842f2f3ed1cf6f4190490ad02e0a755f0c";>8dc8d18
 Update Node.js version i

Re: [PR] test(csharp/test/Drivers/Snowflake): refactor snowflake test framework [arrow-adbc]

2023-12-04 Thread via GitHub


ruowan commented on PR #1323:
URL: https://github.com/apache/arrow-adbc/pull/1323#issuecomment-1840012998

   Resolved. I think we can merge this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Remove macro in iter_to_array for List [arrow-datafusion]

2023-12-04 Thread via GitHub


viirya commented on code in PR #8414:
URL: https://github.com/apache/arrow-datafusion/pull/8414#discussion_r1414844779


##
datafusion/common/src/scalar.rs:
##
@@ -3616,6 +3216,19 @@ mod tests {
 Some(vec![Some(4), Some(5)]),
 ]);
 assert_eq!(list_array, &expected);
+
+let scalars = build_list::(vec![
+vec![Some(1), Some(2), Some(3)],
+vec![Some(4), Some(5)],
+]);

Review Comment:
   Maybe we can add tests that contain null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Remove macro in iter_to_array for List [arrow-datafusion]

2023-12-04 Thread via GitHub


viirya commented on code in PR #8414:
URL: https://github.com/apache/arrow-datafusion/pull/8414#discussion_r1414844561


##
datafusion/common/src/scalar.rs:
##
@@ -1368,103 +1366,26 @@ impl ScalarValue {
 }};
 }
 
-macro_rules! build_array_list_primitive {
-($ARRAY_TY:ident, $SCALAR_TY:ident, $NATIVE_TYPE:ident, 
$LIST_TY:ident, $SCALAR_LIST:pat) => {{
-Ok::(Arc::new($LIST_TY::from_iter_primitive::<$ARRAY_TY, _, _>(
-scalars.into_iter().map(|x| match x{
-ScalarValue::List(arr) if matches!(x, $SCALAR_LIST) => 
{
-// `ScalarValue::List` contains a single element 
`ListArray`.
-let list_arr = as_list_array(&arr);
-if list_arr.is_null(0) {
-Ok(None)
-} else {
-let primitive_arr =
-
list_arr.values().as_primitive::<$ARRAY_TY>();
-Ok(Some(
-
primitive_arr.into_iter().collect::>>(),
-))
-}
-}
-ScalarValue::LargeList(arr) if matches!(x, 
$SCALAR_LIST) =>{
-// `ScalarValue::List` contains a single element 
`ListArray`.
-let list_arr = as_large_list_array(&arr);
-if list_arr.is_null(0) {
-Ok(None)
-} else {
-let primitive_arr =
-
list_arr.values().as_primitive::<$ARRAY_TY>();
-Ok(Some(
-
primitive_arr.into_iter().collect::>>(),
-))
-}
-}
-sv => _internal_err!(
-"Inconsistent types in ScalarValue::iter_to_array. 
\
-Expected {:?}, got {:?}",
-data_type, sv
-),
-})
-.collect::>>()?,
-)))
-}};
-}
-
-macro_rules! build_array_list_string {
-($BUILDER:ident, 
$STRING_ARRAY:ident,$LIST_BUILDER:ident,$SCALAR_LIST:pat) => {{
-let mut builder = $LIST_BUILDER::new($BUILDER::new());
-for scalar in scalars.into_iter() {
-match scalar {
-ScalarValue::List(arr) if matches!(scalar, 
$SCALAR_LIST) => {
-// `ScalarValue::List` contains a single element 
`ListArray`.
-let list_arr = as_list_array(&arr);
-
-if list_arr.is_null(0) {
-builder.append(false);
-continue;
-}
-
-let string_arr = $STRING_ARRAY(list_arr.values());
-
-for v in string_arr.iter() {
-if let Some(v) = v {
-builder.values().append_value(v);
-} else {
-builder.values().append_null();
-}
-}
-builder.append(true);
-}
-ScalarValue::LargeList(arr) if matches!(scalar, 
$SCALAR_LIST) => {
-// `ScalarValue::List` contains a single element 
`ListArray`.
-let list_arr = as_large_list_array(&arr);
-
-if list_arr.is_null(0) {
-builder.append(false);
-continue;
-}
-
-let string_arr = $STRING_ARRAY(list_arr.values());
-
-for v in string_arr.iter() {
-if let Some(v) = v {
-builder.values().append_value(v);
-} else {
-builder.values().append_null();
-}
-}
-builder.append(true);
-}
-sv => {
-return _internal_err!(
-"Inconsistent types in 
ScalarValue::iter_to_array. \
-Expected List, got {:?}",
-sv
-)
-}
-}
-}
-A

Re: [I] [C++][Gandiva] Migration JIT engine from MCJIT to LLJIT [arrow]

2023-12-04 Thread via GitHub


niyue commented on issue #37848:
URL: https://github.com/apache/arrow/issues/37848#issuecomment-1839951586

   Here is the discussion thread: 
   
   https://lists.apache.org/thread/fphzvtr1jrc069z7kv78oopgr4zrjfgl


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [CI] Bump actions/setup-java from 3 to 4 [arrow]

2023-12-04 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #39063:
URL: https://github.com/apache/arrow/pull/39063#issuecomment-1839944650

   After merging your PR, Conbench analyzed the 4 benchmarking runs that have 
been run so far on merge-commit 6958f218528c9cd04ce63152bffb8ff86f42bd15.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/19313381468) 
has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] cretae s3 fialed [arrow-datafusion]

2023-12-04 Thread via GitHub


web3creator opened a new issue, #8421:
URL: https://github.com/apache/arrow-datafusion/issues/8421

   ### Describe the bug
   
   datafusion-cli
   ❯ CREATE EXTERNAL TABLE test
   STORED AS PARQUET
   OPTIONS(
   'region' 'xxx',
   'access_key_id' '',
   'secret_access_key' ''
   )
   LOCATION 'x';
   Invalid or Unsupported Configuration: Found unsupported option access_key_id 
with value  for Parquet format!
   ❯ 
   
   ### To Reproduce
   
   datafusion-cli  
   ```
   datafusion-cli
   ❯ CREATE EXTERNAL TABLE test
   STORED AS PARQUET
   OPTIONS(
   'region' 'xxx',
   'access_key_id' '',
   'secret_access_key' ''
   )
   LOCATION 'x';
   Invalid or Unsupported Configuration: Found unsupported option access_key_id 
with value  for Parquet format!
   ❯ 
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat: ScalarValue from String [arrow-datafusion]

2023-12-04 Thread via GitHub


QuenKar commented on code in PR #8411:
URL: https://github.com/apache/arrow-datafusion/pull/8411#discussion_r1414788401


##
datafusion/common/src/scalar.rs:
##
@@ -4645,6 +4651,16 @@ mod tests {
 );
 }
 
+#[test]
+fn test_scalar_value_from_string() {
+let scalar = ScalarValue::from("foo");
+assert_eq!(scalar, ScalarValue::Utf8(Some("foo".to_string(;
+let scalar = ScalarValue::from("foo".to_string());
+assert_eq!(scalar, ScalarValue::Utf8(Some("foo".to_string(;

Review Comment:
   I find that there is no code similar to `"foo".to_string().as_str() `or 
`some_string.as_str()` or `&some_string` or `as_str().into()` in this use(But 
maybe I don't find all of them...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-39050: [C++] Use Cast() instead of CastTo() for Timestamp Scalar in test [arrow]

2023-12-04 Thread via GitHub


kou merged PR #39060:
URL: https://github.com/apache/arrow/pull/39060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


kou commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1413052434


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_results]

Re: [PR] GH-37884: [Swift] allow reading of unaligned FlatBuffers buffers [arrow]

2023-12-04 Thread via GitHub


kou commented on code in PR #38635:
URL: https://github.com/apache/arrow/pull/38635#discussion_r1414754609


##
swift/Arrow/Package.swift:
##
@@ -32,7 +32,7 @@ let package = Package(
 targets: ["Arrow"]),
 ],
 dependencies: [
-.package(url: "https://github.com/google/flatbuffers.git";, from: 
"23.3.3")
+.package(url: "https://github.com/google/flatbuffers.git";, branch: 
"master")

Review Comment:
   Could you add a comment why we need to use `master` for now and when we can 
specify a tag again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414751422


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_re

Re: [PR] GH-38309: [C++] build filesystems as separate modules [arrow]

2023-12-04 Thread via GitHub


kou commented on code in PR #39067:
URL: https://github.com/apache/arrow/pull/39067#discussion_r1414747502


##
cpp/src/arrow/filesystem/filesystem.cc:
##
@@ -738,6 +761,109 @@ Result> 
FileSystemFromUriReal(const Uri& uri,
 
 }  // namespace
 
+Status RegisterFileSystemFactory(std::vector schemes,
+ FileSystem::Factory factory) {
+  auto& [mutex, scheme_to_factory] = FileSystemFactoryRegistry();
+  std::unique_lock lock{mutex};
+
+  for (auto&& scheme : schemes) {
+if (auto& ref = scheme_to_factory[scheme]) {
+  return Status::KeyError(
+  "Tried to add a factory for ", scheme,
+  ":// uris, but a factory was already registered for that scheme");
+} else {
+  ref = factory;
+}
+  }
+  return Status::OK();
+}
+
+// XXX alternative approach:
+// If we distrust global constructors more than we distrust
+// symbol visibility, we could export specially named functions
+// and retrieve them via dlsym.
+Status RegisterFileSystemFactoryModule(std::string module_name) {
+  static auto* handle = dlopen(nullptr, RTLD_NOW | RTLD_LOCAL);

Review Comment:
   Is this changed to `dlopen(std::string("libarrow_fs_") + module_name + 
".so", ...)` or something eventually, right?



##
cpp/src/arrow/filesystem/filesystem.cc:
##
@@ -677,12 +684,28 @@ Status CopyFiles(const std::shared_ptr& 
source_fs,
 
 namespace {
 
+auto& FileSystemFactoryRegistry() {
+  static struct {
+std::recursive_mutex mutex;
+std::unordered_map scheme_to_factory;
+  } registry;
+  return registry;
+}
+
 Result> FileSystemFromUriReal(const Uri& uri,
   const std::string& 
uri_string,
   const io::IOContext& 
io_context,
   std::string* 
out_path) {
   const auto scheme = uri.scheme();
 
+  {
+auto& [mutex, scheme_to_factory] = FileSystemFactoryRegistry();
+std::unique_lock lock{mutex};

Review Comment:
   Can we use `shared_lock` here?
   It seems that this doesn't change the registry.



##
cpp/src/arrow/filesystem/filesystem.cc:
##
@@ -738,6 +761,109 @@ Result> 
FileSystemFromUriReal(const Uri& uri,
 
 }  // namespace
 
+Status RegisterFileSystemFactory(std::vector schemes,
+ FileSystem::Factory factory) {
+  auto& [mutex, scheme_to_factory] = FileSystemFactoryRegistry();
+  std::unique_lock lock{mutex};
+
+  for (auto&& scheme : schemes) {
+if (auto& ref = scheme_to_factory[scheme]) {
+  return Status::KeyError(
+  "Tried to add a factory for ", scheme,
+  ":// uris, but a factory was already registered for that scheme");
+} else {
+  ref = factory;
+}
+  }
+  return Status::OK();
+}
+
+// XXX alternative approach:

Review Comment:
   I like this approach.
   Because hiding `dlopen()`/`LoadLibrary()` details from users is easy to use.
   (Users need to call `dlopen()`/`LoadLibrary()` explicitly with the 
`Registrar` approach, right?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414744853


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};

Review Comment:
   This is needed because the Azure server might break the page after a "dir/", 
before

Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on PR #39009:
URL: https://github.com/apache/arrow/pull/39009#issuecomment-1839860421

   I rebased to fix a conflict, but the new changes are on the last two commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb closed issue #8329: DataFusion weekly project plan (Andrew Lamb) - Nov 
27, 2023
URL: https://github.com/apache/arrow-datafusion/issues/8329


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb commented on issue #8329:
URL: 
https://github.com/apache/arrow-datafusion/issues/8329#issuecomment-1839853479

   Next Week: https://github.com/apache/arrow-datafusion/issues/8420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb commented on issue #8329:
URL: 
https://github.com/apache/arrow-datafusion/issues/8329#issuecomment-1839853226

   > Thanks @alamb !
   > 
   > Many people have asked me how to get involved in the DataFusion community 
and what they can do for the project (Because I've previously promoted 
DataFusion on the Chinese social media). This summary is very helpful for them.
   
   Thank you for this feedback. It is very helpful. Other than writing up more 
detailed tickets, do you know of other things that would help?
   
   > I have pinned this issue to make it easy for everyone to view.
   
   I saw this -- that was very cool. Thank you. I will pin future such issues. 
   
   > Express gratitude once again for all that you have done for the community.
   
   Thank you very much @jackwener  - I am always amazed by all the work we do 
together. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb opened a new issue, #8420:
URL: https://github.com/apache/arrow-datafusion/issues/8420

   Follow on to [Nov 27 
2023](https://github.com/apache/arrow-datafusion/issues/8329):
   
   ## Boilerplate Overview
   The idea of this ticket is make my plans for DataFusion visible, largely for 
my own personal organizational needs, but also to:
   1. Try some different ways to communicate / coordinate in the community 
   2. Help provide an interesting summary of what is happening in DataFusion 
this week
   
   It would be great if anyone else who has plans like this for DataFusion 
could try to make them visible somehow as well
   
   ## My (personal) plans for this week
   - [ ] Complete https://github.com/apache/arrow-datafusion/issues/8376
   - [ ] Reproduce and file several bugs / limitations we have hit in Influx 
recently
   
   ## Project Queue (list of future projects)
   - [ ] https://github.com/apache/arrow-datafusion/issues/8227 (specifically, 
https://github.com/apache/arrow-datafusion/issues/8229 as a step towards 
https://github.com/apache/arrow-datafusion/issues/8078)
   - [ ] https://github.com/apache/arrow-datafusion/issues/8130, in support of 
memory explosion in HashJoin outputs: 
https://github.com/apache/arrow-datafusion/issues/7848 with @korowa 
   
   
   ## Additional major projects I hope to help review and coordinate
   - [ ] Next steps towards function packages more modular (make all 
BuiltInScalarFunctions ScalarUDFs): 
https://github.com/apache/arrow-datafusion/issues/8045 with @2010YOUY01 . This 
week I hope to get https://github.com/apache/arrow-datafusion/pull/8046 
mergable, which would require 
https://github.com/apache/arrow-datafusion/issues/8157)
   - [ ] `ARRAY` function hardening:  
https://github.com/apache/arrow-datafusion/issues/6980 / 
https://github.com/apache/arrow-datafusion/issues/7988 
   - [ ] Review any new ticket / bug reports 
   
   # Algorithm for (my) prioritizing PR reviews
   Note there are many [committers](https://arrow.apache.org/committers/) who 
can review and merge PRs as well, so this is not the priorities of the project 
as a whole, just the approximate algorithm I am using
   
   Priority:
   1. Bug fixes (where something is just incorrect), especially regressions 
(where it used to work and now does not)
   2. Improvements directly related to features needed for InfluxDB (as I am 
employed by them)
   3. Documentation and test improvements (I view these as very strategically 
important)
   4. PRs that I think are strategically important
   5. Other new features / additions to functionality
   
   The current strategically important projects in my head are:
   
   * https://github.com/apache/arrow-datafusion/issues/8045
   * Improving performance of high cardinality aggregation
   
   Note that adding new array functionality (e.g. 
https://github.com/apache/arrow-datafusion/issues/6980 and its children) is 
very low on my priority list, not because I don't think they are valuable, but 
because I think they are less valuable than the other projects going on - it 
would be great if other people could help pick these up too. 
   
   Thus, if you are interested in contributing to DataFusion and are interested 
in a fast turn around time I would recommend looking into bug fixes / test 
improvements / documentation / etc. 
   
   If you propose adding new functionality, the review cycle will likely be 
longer. You can make it a shorter cycle by looking at the comments on other 
recent PRs and following the same model (e.g. ensure there are tests in 
sqllogictest for example)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

2023-12-04 Thread via GitHub


domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839849622

   Hmm, then we can't really automatically flush in the vector builder. That's 
a bummer. Maybe then I need to implement a chunked buffer that can satisfy the 
`DataBuffer` type. I need to look at what implications that may have. 
   
   Maybe @trxcllnt has some suggestion here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [CI] Bump actions/setup-java from 3 to 4 [arrow]

2023-12-04 Thread via GitHub


kou merged PR #39063:
URL: https://github.com/apache/arrow/pull/39063


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [CI] Bump actions/labeler from 4.3.0 to 5.0.0 [arrow]

2023-12-04 Thread via GitHub


kou commented on PR #39062:
URL: https://github.com/apache/arrow/pull/39062#issuecomment-1839847476

   Hmm. It seems that we need to change our configuration files for this 
something like the following:
   
   ```diff
   diff --git a/.github/workflows/dev_pr/labeler.yml 
b/.github/workflows/dev_pr/labeler.yml
   index c31d1f309..bb4d261f8 100644
   --- a/.github/workflows/dev_pr/labeler.yml
   +++ b/.github/workflows/dev_pr/labeler.yml
   @@ -16,7 +16,9 @@
# under the License.

"Component: C++":
   -  - cpp/**/*
   +  - any:
   +- any-glob-to-any-file:
   +  - cpp/**/*

"Component: GLib":
  - c_glib/**/*
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414734379


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;

Review Comment:
   Something I prefer to do once all PRs land because it would be a noisy 
change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414734002


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);

Review Comment:
   Code evolves in mysterious ways when it's late into the workday 🤡 
   
   It's gonna be `ConcatAbstractPath` everywhere now.



##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);

Review Comment:
   Code evolves in mysterious ways when it's late into the workday 🤡 
   
   It's gonna be `ConcatAbstractPath` everywhere now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Minor: Improve `PruningPredicate` documentation [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb merged PR #8394:
URL: https://github.com/apache/arrow-datafusion/pull/8394


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Minor: Add example with parameters to LogicalPlan [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb merged PR #8418:
URL: https://github.com/apache/arrow-datafusion/pull/8418


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

2023-12-04 Thread via GitHub


kylebarron commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839842156

   But then you call
   
   ```ts
   table.batches[0].numRows
   // 1
   ```
   
   Implicitly, a batch has a constant number of rows across all `Data` 
instances.
   
   When I test it in Python, surprisingly the `Table` constructor _allows_ it, 
but then when you call `to_batches` it reaggregates chunks into consistent 
record batches:
   
   ```py
   import pyarrow as pa
   
   d1 = [
   pa.array([1, 2], pa.int32()),
   pa.array([3], pa.int32()),
   ]
   d2 = [
   pa.array([4], pa.int32()),
   pa.array([5, 6], pa.int32()),
   ]
   
   c1 = pa.chunked_array(d1)
   c2 = pa.chunked_array(d2)
   table = pa.table({"v1": c1, "v2": c2})
   table
   # pyarrow.Table
   # v1: int32
   # v2: int32
   # 
   # v1: [[1,2],[3]]
   # v2: [[4],[5,6]]
   
   # to_batches regroups into batches of length 1
   table.to_batches()
   # [pyarrow.RecordBatch
   #  v1: int32
   #  v2: int32
   #  
   #  v1: [1]
   #  v2: [4],
   #  pyarrow.RecordBatch
   #  v1: int32
   #  v2: int32
   #  
   #  v1: [2]
   #  v2: [5],
   #  pyarrow.RecordBatch
   #  v1: int32
   #  v2: int32
   #  
   #  v1: [3]
   #  v2: [6]]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414731605


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_re

Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414730259


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_re

Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414723747


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,

Review Comment:
   I will change it to `VisitContainers`. `ListContainers` is indeed a bad name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414722781


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -1103,7 +1330,12 @@ Result AzureFileSystem::GetFileInfo(const 
std::string& path) {
 }
 
 Result AzureFileSystem::GetFileInfo(const FileSelector& 
select) {
-  return Status::NotImplemented("The Azure FileSystem is not fully 
implemented");
+  Azure::Core::Context context;
+  Azure::Nullable page_size_hint;  // unspecified
+  FileInfoVector results;
+  RETURN_NOT_OK(
+  impl_->GetFileInfoWithSelector(context, page_size_hint, select, 
&results));

Review Comment:
   Since `Impl` doesn't necessarily have to provide overloaded methods like the 
public interface does I opted for a clearer name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414721290


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_re

Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414720550


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -1103,7 +1330,12 @@ Result AzureFileSystem::GetFileInfo(const 
std::string& path) {
 }
 
 Result AzureFileSystem::GetFileInfo(const FileSelector& 
select) {
-  return Status::NotImplemented("The Azure FileSystem is not fully 
implemented");
+  Azure::Core::Context context;
+  Azure::Nullable page_size_hint;  // unspecified

Review Comment:
   I want all the wiring to be in place if we need to specify the 
`page_size_hint` which I'm pretty sure we will need to tweak when this is used 
in practice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414718868


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -1103,7 +1330,12 @@ Result AzureFileSystem::GetFileInfo(const 
std::string& path) {
 }
 
 Result AzureFileSystem::GetFileInfo(const FileSelector& 
select) {
-  return Status::NotImplemented("The Azure FileSystem is not fully 
implemented");
+  Azure::Core::Context context;
+  Azure::Nullable page_size_hint;  // unspecified
+  FileInfoVector results;
+  RETURN_NOT_OK(
+  impl_->GetFileInfoWithSelector(context, page_size_hint, select, 
&results));
+  return std::move(results);

Review Comment:
   This is because I want to move the `results` into the `Result` that is 
being created.
   
   It's equivalent to this:
   
   ```cpp
 Result result{std::move(results)};
 return result;
   ```
   
   I will change the line to `return {std::move(results)};` to clarify what is 
going on.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414717355


##
cpp/src/arrow/filesystem/path_util.cc:
##
@@ -72,15 +72,12 @@ std::string SliceAbstractPath(const std::string& s, int 
offset, int length, char
 return "";
   }
   std::vector components = SplitAbstractPath(s, sep);
-  std::stringstream combined;
   if (offset >= static_cast(components.size())) {
 return "";
   }
-  int end = offset + length;
-  if (end > static_cast(components.size())) {
-end = static_cast(components.size());
-  }
-  for (int i = offset; i < end; i++) {
+  const size_t end = std::min(static_cast(offset) + length, 
components.size());

Review Comment:
   Fixing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38705: [C++][FS][Azure] Implement CopyFile() [arrow]

2023-12-04 Thread via GitHub


kou commented on PR #39058:
URL: https://github.com/apache/arrow/pull/39058#issuecomment-1839825528

   > > I think there is a problem with the `CopyFromUri()` API though.
   > 
   > Nevermind, it looks like you've got a working implementation using 
`CopyFromUri()`. Maybe it does auth automatically if it detects that the source 
is in the same storage account. Or possibly the Azure SDK automatically 
switches between [Copy Blob From 
URL](https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob-from-url?tabs=microsoft-entra-id)
 and [Copy 
Blob](https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob?tabs=microsoft-entra-id)
   
   I think that "Copy Blob From URL" and "Copy Blob" are the same API 
internally.
   Both of them use the same URI and `PUT`:
   * Copy Blob From URL:  
https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob-from-url#request
   * Copy Blob: 
https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob#request
   
   And `BlobClient` provides only `CopyFromUri()`. (It doesn't provide 
`Copy()`.)
   So I used `CopyFromUri()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414716712


##
cpp/src/arrow/filesystem/azurefs_test.cc:
##
@@ -479,6 +522,133 @@ TEST_F(AzureHierarchicalNamespaceFileSystemTest, 
GetFileInfoObject) {
   RunGetFileInfoObjectTest();
 }
 
+TEST_F(AzuriteFileSystemTest, GetFileInfoSelector) {
+  SetUpSmallFileSystemTree();
+
+  FileSelector select;
+  std::vector infos;
+
+  // Root dir
+  select.base_dir = "";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 3);
+  ASSERT_EQ(infos, SortedInfos(infos));
+  AssertFileInfo(infos[0], "container", FileType::Directory);
+  AssertFileInfo(infos[1], "empty-container", FileType::Directory);
+  AssertFileInfo(infos[2], container_name_, FileType::Directory);
+
+  // Empty container
+  select.base_dir = "empty-container";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 0);
+  // Nonexistent container
+  select.base_dir = "nonexistent-container";
+  ASSERT_RAISES(IOError, fs_->GetFileInfo(select));
+  select.allow_not_found = true;
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 0);
+  select.allow_not_found = false;
+  // Non-empty container
+  select.base_dir = "container";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos, SortedInfos(infos));
+  ASSERT_EQ(infos.size(), 4);
+  AssertFileInfo(infos[0], "container/emptydir", FileType::Directory);
+  AssertFileInfo(infos[1], "container/otherdir", FileType::Directory);
+  AssertFileInfo(infos[2], "container/somedir", FileType::Directory);
+  AssertFileInfo(infos[3], "container/somefile", FileType::File, 9);
+
+  // Empty "directory"
+  select.base_dir = "container/emptydir";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 0);
+  // Non-empty "directories"
+  select.base_dir = "container/somedir";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 1);
+  AssertFileInfo(infos[0], "container/somedir/subdir", FileType::Directory);
+  select.base_dir = "container/somedir/subdir";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 1);
+  AssertFileInfo(infos[0], "container/somedir/subdir/subfile", FileType::File, 
8);
+  // Nonexistent
+  select.base_dir = "container/nonexistent";
+  ASSERT_RAISES(IOError, fs_->GetFileInfo(select));
+  select.allow_not_found = true;
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 0);
+  select.allow_not_found = false;
+
+  // Trailing slashes
+  select.base_dir = "empty-container/";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 0);
+  select.base_dir = "nonexistent-container/";
+  ASSERT_RAISES(IOError, fs_->GetFileInfo(select));
+  select.base_dir = "container/";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos, SortedInfos(infos));
+  ASSERT_EQ(infos.size(), 4);
+}
+
+TEST_F(AzuriteFileSystemTest, GetFileInfoSelectorRecursive) {
+  SetUpSmallFileSystemTree();
+
+  FileSelector select;
+  select.recursive = true;
+
+  std::vector infos;
+  // Root dir
+  select.base_dir = "";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 14);

Review Comment:
   I know. This very cheap check I added to ensure simple mistakes in the code 
don't immediately lead you to the more confusing error messages produced by the 
code that follows it:
   
   ```cpp
 ASSERT_EQ(infos, SortedInfos(infos));
 AssertInfoAllContainersRecursive(infos);
   ```
   
   I don't think checking the length 3 times costs anything significant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] SQL: Ambiguous reference when aliasing in combination with ORDER BY [arrow-datafusion]

2023-12-04 Thread via GitHub


Asura7969 commented on issue #8391:
URL: 
https://github.com/apache/arrow-datafusion/issues/8391#issuecomment-1839821571

   Do you have any plans to fix this bug? If you don't have time I'll follow up 
@Jesse-Bakker 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414712577


##
cpp/src/arrow/filesystem/azurefs_test.cc:
##
@@ -479,6 +522,133 @@ TEST_F(AzureHierarchicalNamespaceFileSystemTest, 
GetFileInfoObject) {
   RunGetFileInfoObjectTest();
 }
 
+TEST_F(AzuriteFileSystemTest, GetFileInfoSelector) {
+  SetUpSmallFileSystemTree();
+
+  FileSelector select;
+  std::vector infos;
+
+  // Root dir
+  select.base_dir = "";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 3);
+  ASSERT_EQ(infos, SortedInfos(infos));
+  AssertFileInfo(infos[0], "container", FileType::Directory);
+  AssertFileInfo(infos[1], "empty-container", FileType::Directory);
+  AssertFileInfo(infos[2], container_name_, FileType::Directory);
+
+  // Empty container
+  select.base_dir = "empty-container";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 0);
+  // Nonexistent container
+  select.base_dir = "nonexistent-container";
+  ASSERT_RAISES(IOError, fs_->GetFileInfo(select));
+  select.allow_not_found = true;
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos.size(), 0);
+  select.allow_not_found = false;
+  // Non-empty container
+  select.base_dir = "container";
+  ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select));
+  ASSERT_EQ(infos, SortedInfos(infos));
+  ASSERT_EQ(infos.size(), 4);
+  AssertFileInfo(infos[0], "container/emptydir", FileType::Directory);
+  AssertFileInfo(infos[1], "container/otherdir", FileType::Directory);
+  AssertFileInfo(infos[2], "container/somedir", FileType::Directory);
+  AssertFileInfo(infos[3], "container/somefile", FileType::File, 9);

Review Comment:
   Single-use constants are very common in tests. I don't see how putting these 
in a constant to be used only once (an unnecessary indirection) can help the 
person debugging these unit tests in case they start failing.
   
   Do you have name suggestions for the constants?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-37884: [Swift] allow reading of unaligned FlatBuffers buffers [arrow]

2023-12-04 Thread via GitHub


abandy commented on PR #38635:
URL: https://github.com/apache/arrow/pull/38635#issuecomment-1839817261

   @kou Please review when you get a chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414708333


##
cpp/src/arrow/filesystem/azurefs_test.cc:
##
@@ -287,6 +287,49 @@ class AzureFileSystemTest : public ::testing::Test {
 
   void RunGetFileInfoObjectWithNestedStructureTest();
   void RunGetFileInfoObjectTest();
+
+  void SetUpSmallFileSystemTree() {
+// Set up test containers
+
blob_service_client_->GetBlobContainerClient("empty-container").CreateIfNotExists();
+
+auto container_client = 
blob_service_client_->GetBlobContainerClient("container");
+container_client.CreateIfNotExists();
+
+auto blob_client = container_client.GetBlockBlobClient("emptydir/");
+blob_client.UploadFrom(reinterpret_cast(""), 0);
+
+blob_client = 
container_client.GetBlockBlobClient("somedir/subdir/subfile");
+const char* sub_data = "sub data";
+blob_client.UploadFrom(reinterpret_cast(sub_data), 
strlen(sub_data));
+
+blob_client = container_client.GetBlockBlobClient("somefile");
+const char* some_data = "some data";
+blob_client.UploadFrom(reinterpret_cast(some_data),
+   strlen(some_data));
+
+blob_client = 
container_client.GetBlockBlobClient("otherdir/1/2/3/otherfile");
+const char* other_data = "other data";
+blob_client.UploadFrom(reinterpret_cast(other_data),
+   strlen(other_data));

Review Comment:
   `fs_` is the abstraction on top of the blob client, if I use the abstraction 
to produce test data bugs bugs could cancel each other out. Even when that is 
not the case, unit test failures are much harder to debug if you use the code 
being tested to also produce the ground truth test data you use to test the 
code.
   
   The s3fs tests use the S3 SDK directly to produce test data and I suppose 
it's for these reasons.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414706522


##
cpp/src/arrow/filesystem/azurefs_test.cc:
##
@@ -213,7 +213,7 @@ class AzureFileSystemTest : public ::testing::Test {
   suite_skipped_ = true;
   GTEST_SKIP() << options.status().message();
 }
-container_name_ = RandomChars(32);
+container_name_ = "z" + RandomChars(31);

Review Comment:
   The `Prexisting*` paths are added automatically by `SetUp`. I want it to 
start with `z` to make all paths sort deterministically when other paths are 
added which is guaranteed by my paths starting with letters before `z`.
   
   The right solution here is to have every test populate the filesystem 
explicitly and make `SetUp` simpler, but I don't feel like doing it now when so 
many people are adding and changing the tests in the same file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414704929


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_re

Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]

2023-12-04 Thread via GitHub


domoritz commented on PR #35780:
URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839808240

   Is it actually an issue that chunks have different lengths? Seems to work 
fine here
   
   ```js
   const b1 = makeBuilder({ type: new Int32 });
   const b2 = makeBuilder({ type: new Int32 });
   
   const d1 = [] as Data[];
   const d2 = [] as Data[];
   
   b1.append(1);
   b1.append(2);
   d1.push(b1.flush());
   b1.append(3);
   d1.push(b1.flush());
   
   b2.append(4);
   d2.push(b2.flush());
   b2.append(5);
   b2.append(6);
   d2.push(b2.flush());
   
   const v1 = new Vector(d1);
   const v2 = new Vector(d2);
   
   const table = new Table({ v1, v2 });
   console.log(table.toArray());
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][FS][Azure] Should `GetFileInfo()` against a directory always return true without hierarchical namespace support? [arrow]

2023-12-04 Thread via GitHub


kou commented on issue #38772:
URL: https://github.com/apache/arrow/issues/38772#issuecomment-1839807333

   @felipecrv @Tom-Newton Do either of you want to work on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38705: [C++][FS][Azure] Implement CopyFile() [arrow]

2023-12-04 Thread via GitHub


kou commented on PR #39058:
URL: https://github.com/apache/arrow/pull/39058#issuecomment-1839806315

   Oh, sorry. I didn't read the docstring. I'll remove the behavior.
   
   > Also, at some point the Azure fs implementation will have to implement and 
pass the [generic filesystem 
tests](https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/filesystem/test_util.h#L120-L245),
 which do have a test [for this 
situation](https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/filesystem/test_util.cc#L555-L558).
   
   I've opened a new issue for it: #39069


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414700135


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_re

Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414699614


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};
+  found = true;  // Unless the container itself is not found later!
+} else {
+  options.Prefix = internal::EnsureTrailingSlash(base_location.path);
+}
+options.PageSizeHint = page_size_hint;
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
+
+// When Prefix.Value() contains a trailing slash and we find a blob that
+// matches it completely, it is an empty directory marker blob for the
+// directory we're listing from, and we should skip it.
+auto is_empty_dir_marker =
+[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) 
noexcept -> bool {
+  return options.Prefix.HasValue() && blob.Name == options.Prefix.Value();
+};
+
+auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
+  if (select.recursive && select.max_recursion > 0) {
+FileSelector sub_select;
+sub_select.base_dir = base_location.container;
+sub_select.base_dir += internal::kSep;
+sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix);
+sub_select.allow_not_found = true;
+sub_select.recursive = true;
+sub_select.max_recursion = select.max_recursion - 1;
+return GetFileInfoWithSelectorFromContainer(
+container_client, context, page_size_hint, sub_select, 
acc_results);
+  }
+  return Status::OK();
+};
+
+// (*acc_results)[*last_dir_reported] is the last FileType::Directory in 
the results
+// produced through this loop over the response pages.
+std::optional last_dir_reported{};
+auto matches_last_dir_reported = [&last_dir_reported,
+  acc_re

Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414698838


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }
+
+  /// \brief List the blobs at the root of a container or some dir in a 
container.
+  ///
+  /// \pre container_client is the client for the container named like the 
first
+  /// segment of select.base_dir.
+  Status GetFileInfoWithSelectorFromContainer(
+  const Azure::Storage::Blobs::BlobContainerClient& container_client,
+  const Azure::Core::Context& context, Azure::Nullable 
page_size_hint,
+  const FileSelector& select, FileInfoVector* acc_results) {
+ARROW_ASSIGN_OR_RAISE(auto base_location, 
AzureLocation::FromString(select.base_dir));
+
+bool found = false;
+Azure::Storage::Blobs::ListBlobsOptions options;
+if (internal::GetAbstractPathDepth(base_location.path) == 0) {
+  // If the base_dir is the root of the container, then we want to list 
all blobs in
+  // the container and the Prefix should be empty and not even include the 
trailing
+  // slash because the container itself represents the `/` 
directory.
+  options.Prefix = {};

Review Comment:
   The compiler can elide it, so its reason of existing is to anchor the 
comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414697321


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }

Review Comment:
   I'm changing it to this:
   
   ```cpp
 static std::string_view BasenameView(std::string_view s) {
   DCHECK(!internal::HasTrailingSlash(s));
   auto offset = s.find_last_of(internal::kSep);
   auto result = (offset == std::string_view::npos) ? s : s.substr(offset);
   DCHECK(!result.empty() && result.back() != internal::kSep);
   return result;
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414680091


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;
+FileInfo info{std::move(path), FileType::File};
+info.set_size(blob.BlobSize);
+
info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified});
+return info;
+  }
+
+  static FileInfo DirectoryFileInfoFromPath(const std::string& path) {
+return FileInfo{std::string{internal::RemoveTrailingSlash(path)},
+FileType::Directory};
+  }
+
+  static std::string_view BasenameView(std::string_view s) {
+auto offset = s.find_last_of(internal::kSep);
+auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
+return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
+  }

Review Comment:
   The problem with `path_util.h` functions is that they don't follow any 
concept of path normalization for inputs and have to check the input for many 
variations, this `BasenameView` works well in this context, but not in a 
general context.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414675638


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;
+try {
+  auto container_list_response =
+  blob_service_client_->ListBlobContainers(options, context);
+  for (; container_list_response.HasPage();
+   container_list_response.MoveToNextPage(context)) {
+for (const auto& container : container_list_response.BlobContainers) {
+  RETURN_NOT_OK(on_container(container));
+}
+  }
+} catch (const Azure::Storage::StorageException& exception) {
+  return internal::ExceptionToStatus("Failed to list account containers.", 
exception);
+}
+return Status::OK();
+  }
+
+  static FileInfo FileInfoFromBlob(const std::string& container,
+   const 
Azure::Storage::Blobs::Models::BlobItem& blob) {
+if (blob.Name.back() == internal::kSep) {
+  return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name);
+}
+std::string path;
+path.reserve(container.size() + 1 + blob.Name.size());
+path += container;
+path += internal::kSep;
+path += blob.Name;

Review Comment:
   My goal here was to not rely on the semantic of the path_util utilities 
which can be very confusing with all the checks and indirections, but I will 
use `ConcatAbstractPath` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] fix: Literal in window definition should not an ordinal refering to relation column [arrow-datafusion]

2023-12-04 Thread via GitHub


viirya opened a new pull request, #8419:
URL: https://github.com/apache/arrow-datafusion/pull/8419

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   This continues to fix one bug found in #8410. In particular, currently a 
literal value in `ORDER BY`  clause of window definition will be treated as an 
ordinal reference to a column in the relation which I think is the behavior of 
sort expressions in `ORDER BY` keyword. However for window definition, Postgres 
simply treats a literal value as literal:
   
   ```
   postgres=# select a, rank() over (order by 1 RANGE between UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING) rnk from (select 1 a union select 2 a) q;
a | rnk 
   ---+-
1 |   1
2 |   1
   (2 rows)
   
   postgres=# select a, rank() over (order by 1 RANGE between 1 PRECEDING AND 1 
FOLLOWING) rnk from (select 1 a union select 2 a) q;
a | rnk 
   ---+-
1 |   1
2 |   1
   (2 rows)
   
   ```
   
   ## What changes are included in this PR?
   
   
   
   Making a literal value in `ORDER BY` window definition as literal, instead 
of a column ordinal.
   
   ## Are these changes tested?
   
   
   
   Modified test in `sqllogictest`.
   
   ## Are there any user-facing changes?
   
   
   
   
   
   Literal in `ORDER BY` window definition is changed to literal instead of a 
column ordinal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]

2023-12-04 Thread via GitHub


felipecrv commented on code in PR #39009:
URL: https://github.com/apache/arrow/pull/39009#discussion_r1414655422


##
cpp/src/arrow/filesystem/azurefs.cc:
##
@@ -815,6 +815,233 @@ class AzureFileSystem::Impl {
 }
   }
 
+ private:
+  template 
+  Status ListContainers(const Azure::Core::Context& context,
+OnContainer&& on_container) const {
+Azure::Storage::Blobs::ListBlobContainersOptions options;
+// Deleted containers are not returned.
+options.Include = 
Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None;

Review Comment:
   I wanted to leave a placeholder for extension later (if needed), but I will 
delete these two lines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [CI] Bump actions/setup-dotnet from 3 to 4 [arrow]

2023-12-04 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #39061:
URL: https://github.com/apache/arrow/pull/39061#issuecomment-1839737398

   After merging your PR, Conbench analyzed the 4 benchmarking runs that have 
been run so far on merge-commit 87fbf8990671d254ca77c36c99f24e773c6fdafd.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/19308528838) 
has more details. It also includes information about 1 possible false positive 
for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame [arrow-datafusion]

2023-12-04 Thread via GitHub


viirya merged PR #8410:
URL: https://github.com/apache/arrow-datafusion/pull/8410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame [arrow-datafusion]

2023-12-04 Thread via GitHub


viirya commented on PR #8410:
URL: 
https://github.com/apache/arrow-datafusion/pull/8410#issuecomment-1839713617

   Thanks @jackwener 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Minor: Add installation link to README.md [arrow-datafusion]

2023-12-04 Thread via GitHub


comphead merged PR #8389:
URL: https://github.com/apache/arrow-datafusion/pull/8389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Make it easier to install / get started with DataFusion [arrow-datafusion]

2023-12-04 Thread via GitHub


comphead closed issue #7297: Make it easier to install / get started with 
DataFusion
URL: https://github.com/apache/arrow-datafusion/issues/7297


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] enable field-order-agnostic overloads of `fromarrow` for struct types [arrow-julia]

2023-12-04 Thread via GitHub


jrevels commented on PR #493:
URL: https://github.com/apache/arrow-julia/pull/493#issuecomment-1839654384

   Alright, I changed the approach here after profiling different candidate 
approaches. Looks like leaving `NamedTuple` construction (or even just any 
intermediate struct construction) in the critical path here borks `getindex` 
performance, annoyingly, so ended up not going that route. Users of this 
feature can do still construct `NamedTuple`s themselves, if needed, but this 
new approach avoids doing so on the default path. 
   
   on this PR now:
   
   ```jl
   BenchmarkTools.Trial: 8170 samples with 1 evaluation.
Range (min … max):  530.500 μs …   2.318 ms  ┊ GC (min … max): 0.00% … 
74.22%
Time  (median): 547.833 μs   ┊ GC (median):0.00%
Time  (mean ± σ):   610.966 μs ± 263.948 μs  ┊ GC (mean ± σ):  9.75% ± 
14.73%
   
 █▆▃▁  ▁▃▁ ▁
 ▇▇▅▃▅▁▁▃▁▄███ █
 530 μsHistogram: log(frequency) by time   1.75 ms <
   
Memory estimate: 2.22 MiB, allocs estimate: 18387.
```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Consolidate statistics aggregation [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb commented on issue #8229:
URL: 
https://github.com/apache/arrow-datafusion/issues/8229#issuecomment-1839645692

   https://github.com/apache/arrow-datafusion/pull/8254 has the basics in 
place, but I need to work on other higher priority items this week and next so 
this has been pushed lower in my priority list


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Support AVRO Format for Write Queries [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb commented on issue #7679:
URL: 
https://github.com/apache/arrow-datafusion/issues/7679#issuecomment-1839611958

   > @alamb Hi, I can help with this ticket : )
   
   Thanks @Veeupup -- for this one the first thing we need to do is get an avro 
writer and the second part is hooking it into DataFusion. There is some work 
upstream in arrow-rs to make an avro reader/writer in 
https://github.com/apache/arrow-rs/issues/4886   but I think that project is 
stilled at the moment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Easy way to zero-copy IPC buffers. [arrow-rs]

2023-12-04 Thread via GitHub


comath opened a new pull request, #5165:
URL: https://github.com/apache/arrow-rs/pull/5165

   # Which issue does this PR close?
   
   Closes #5153
   
   # Rationale for this change

   There are ways to memmap buffers, if you know the header information, but no 
easy way to use the existing IPC file format in a zero-copy way. 
   
   # What changes are included in this PR?
   
   Reused the IPC reader code to read the header and footer information from an 
IPC file, but swapped out the read/copy of the buffer for the Buffer creation 
with `Buffer::from_custom_allocation`. Minor savings can be had by not 
allocating the arbitrary-sized footer and message read, but the easy way to do 
this requires a nightly API in the Cursor object. 
   
   This duplicated some code, but it's messy to deduplicate because most of it 
is boilerplate around reading the flatbuffer messages. The better way to 
deduplicate is to change the `FileReader` to optionally accept a buffer and 
conditionally use that in the 2 locations where buffers are actually read. 
Perfectly happy to do it this way as well.
   
   # Are there any user-facing changes?
   
   There's a new struct that mirrors the current IPC reader struct that passes 
back zero-copy buffers. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [R] nanoarrow as an interchange class? [arrow-nanoarrow]

2023-12-04 Thread via GitHub


paleolimbot commented on issue #331:
URL: 
https://github.com/apache/arrow-nanoarrow/issues/331#issuecomment-1839578943

   I *think* that is a good idea, although I might need a more specific example 
to more concretely comment. Today you might have to use `as_nanoarrow_array()` 
as a backup. It may not be quite the same, but in the `geos` and `s2` and 
`geoarrow` packages I fall back on `wk_handle()`'s generic (which avoids most 
of those packages having to know about any of the others). I'd like to move 
those to start using `as_nanoarrow_array_stream()` in a way like you described.
   
   In general, the pattern I envisioned for interchange is:
   
   - Producers that produce objects that can be interpreted as arrays implement 
an `as_nanoarrow_array()` S3 method (e.g., `arrow::Array` has a method for 
`as_nanoarrow_array()`)
   - Consumers that accept arrays call `as_nanoarrow_array()` to sanitize the 
input.
   
   The `as_nanoarrow_array_stream()` version is slightly more generic (e.g., 
can accommodate streaming results and/or Chunked arrays), but isn't *quite* 
where it should be (e.g., if you call `as_nanoarrow_array_stream()` on an 
`arrow::ChunkedArray` today, I think you would get an error, even though it 
*should* work).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Minor: Add example with parameters to LogicalPlan [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb opened a new pull request, #8418:
URL: https://github.com/apache/arrow-datafusion/pull/8418

   ## Which issue does this PR close?
   
   Follow on to https://github.com/apache/arrow-datafusion/pull/8384
   
   ## Rationale for this change
   
   Add an example in the Doc so it is slightly easier to figure out how to use 
named query parameter
   
   ## What changes are included in this PR?
   
   Extra doc example
   
   ## Are these changes tested?
   Yes, doc test
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] chore: some debug info [arrow-datafusion]

2023-12-04 Thread via GitHub


NGA-TRAN opened a new pull request, #8417:
URL: https://github.com/apache/arrow-datafusion/pull/8417

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add Projection to FilterExec [arrow-datafusion]

2023-12-04 Thread via GitHub


Dandandan closed pull request #8416: Add Projection to FilterExec
URL: https://github.com/apache/arrow-datafusion/pull/8416


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add Projection to FilterExec [arrow-datafusion]

2023-12-04 Thread via GitHub


Dandandan commented on PR #8416:
URL: 
https://github.com/apache/arrow-datafusion/pull/8416#issuecomment-1839535102

   This approach is not going to work at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-15058: [C++][Python] Native support for UUID [arrow]

2023-12-04 Thread via GitHub


rok commented on PR #37298:
URL: https://github.com/apache/arrow/pull/37298#issuecomment-1839526314

   @arogozhnikov I think it was mostly about lack of reviews :)
   I've rebased, let's see what current problems are.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Support named query parameters [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb merged PR #8384:
URL: https://github.com/apache/arrow-datafusion/pull/8384


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Support named query parameters [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb closed issue #8245: Support named query parameters
URL: https://github.com/apache/arrow-datafusion/issues/8245


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Support named query parameters [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb commented on PR #8384:
URL: 
https://github.com/apache/arrow-datafusion/pull/8384#issuecomment-1839525992

   Thanks @Asura7969 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Make filter selectivity for statistics configurable [arrow-datafusion]

2023-12-04 Thread via GitHub


alamb commented on code in PR #8243:
URL: https://github.com/apache/arrow-datafusion/pull/8243#discussion_r1414535729


##
datafusion/physical-plan/src/filter.rs:
##
@@ -994,4 +1014,22 @@ mod tests {
 
 Ok(())
 }
+
+#[tokio::test]
+async fn test_validation_filter_selectivity() -> Result<()> {

Review Comment:
   > Not sure however how 
[this](https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/filter.rs#L221)
 should be handled
   
   That code will be invoked for 'complicated' predicates -- maybe we could 
fake it with something like `sin(x) = 4.0`.



##
datafusion/physical-plan/src/filter.rs:
##
@@ -994,4 +1014,54 @@ mod tests {
 
 Ok(())
 }
+
+#[tokio::test]
+async fn test_validation_filter_selectivity() -> Result<()> {
+let schema = Schema::new(vec![Field::new("a", DataType::Int32, 
false)]);
+let input = Arc::new(StatisticsExec::new(
+Statistics::new_unknown(&schema),
+schema,
+));
+// WHERE a = 10
+let predicate = Arc::new(BinaryExpr::new(
+Arc::new(Column::new("a", 0)),
+Operator::Eq,
+Arc::new(Literal::new(ScalarValue::Int32(Some(10,
+));
+let filter = FilterExec::try_new(predicate, input)?;
+assert!(filter.with_default_selectivity(120).is_err());
+Ok(())
+}
+
+#[tokio::test]
+async fn test_custom_filter_selectivity() -> Result<()> {
+// Need a decimal to trigger inexact selectivity
+let schema =
+Schema::new(vec![Field::new("a", DataType::Decimal128(2, 3), 
false)]);
+let input = Arc::new(StatisticsExec::new(
+Statistics {
+num_rows: Precision::Inexact(1000),
+total_byte_size: Precision::Inexact(4000),
+column_statistics: vec![ColumnStatistics {
+..Default::default()
+}],
+},
+schema,
+));
+// WHERE a = 10
+let predicate = Arc::new(BinaryExpr::new(
+Arc::new(Column::new("a", 0)),
+Operator::Eq,
+Arc::new(Literal::new(ScalarValue::Decimal128(Some(10), 10, 10))),
+));
+let filter = FilterExec::try_new(predicate, input)?;
+let statistics = filter.statistics()?;
+assert_eq!(statistics.num_rows, Precision::Inexact(200));

Review Comment:
   👌 very nice



##
datafusion/physical-plan/src/filter.rs:
##
@@ -994,4 +1014,54 @@ mod tests {
 
 Ok(())
 }
+
+#[tokio::test]
+async fn test_validation_filter_selectivity() -> Result<()> {
+let schema = Schema::new(vec![Field::new("a", DataType::Int32, 
false)]);
+let input = Arc::new(StatisticsExec::new(
+Statistics::new_unknown(&schema),
+schema,
+));
+// WHERE a = 10
+let predicate = Arc::new(BinaryExpr::new(
+Arc::new(Column::new("a", 0)),
+Operator::Eq,
+Arc::new(Literal::new(ScalarValue::Int32(Some(10,
+));
+let filter = FilterExec::try_new(predicate, input)?;
+assert!(filter.with_default_selectivity(120).is_err());
+Ok(())
+}
+
+#[tokio::test]
+async fn test_custom_filter_selectivity() -> Result<()> {
+// Need a decimal to trigger inexact selectivity
+let schema =
+Schema::new(vec![Field::new("a", DataType::Decimal128(2, 3), 
false)]);
+let input = Arc::new(StatisticsExec::new(
+Statistics {
+num_rows: Precision::Inexact(1000),
+total_byte_size: Precision::Inexact(4000),
+column_statistics: vec![ColumnStatistics {
+..Default::default()
+}],
+},
+schema,
+));
+// WHERE a = 10
+let predicate = Arc::new(BinaryExpr::new(
+Arc::new(Column::new("a", 0)),
+Operator::Eq,
+Arc::new(Literal::new(ScalarValue::Decimal128(Some(10), 10, 10))),
+));
+let filter = FilterExec::try_new(predicate, input)?;
+let statistics = filter.statistics()?;
+assert_eq!(statistics.num_rows, Precision::Inexact(200));

Review Comment:
   👌 very nice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [CI] Bump actions/setup-dotnet from 3 to 4 [arrow]

2023-12-04 Thread via GitHub


kou merged PR #39061:
URL: https://github.com/apache/arrow/pull/39061


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]

2023-12-04 Thread via GitHub


jorisvandenbossche commented on code in PR #38008:
URL: https://github.com/apache/arrow/pull/38008#discussion_r1414528595


##
python/pyarrow/array.pxi:
##
@@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray):
 )
 
 
+cdef class VariableShapeTensorArray(ExtensionArray):
+"""
+Concrete class for variable shape tensor extension arrays.
+
+Examples
+
+Define the extension type for tensor array
+
+>>> import pyarrow as pa
+>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2)
+
+Create an extension array
+
+>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2))
+>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32()))
+>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", 
"data"])
+>>> pa.ExtensionArray.from_storage(tensor_type, arr)
+
+-- is_valid: all not null
+-- child 0 type: fixed_size_list[2]
+  [
+[
+  2,
+  3
+],
+[
+  1,
+  2
+]
+  ]
+-- child 1 type: list
+  [
+[
+  1,
+  2,
+  3,
+  4,
+  5,
+  6
+],
+[
+  7,
+  8
+]
+  ]
+"""
+
+def to_numpy_ndarray(self):

Review Comment:
   BTW this method also has a different return type than the 
`FixedShapeTensorArray.to_numpy_ndarray` we added, so that's another reason to 
use a different name (or just remove it)
   
   For FixedShapeTensorArray the idea was that we wanted to be able to return a 
single nD array, which is possible for fixed shape, but not for variable shape.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]

2023-12-04 Thread via GitHub


rok commented on code in PR #38008:
URL: https://github.com/apache/arrow/pull/38008#discussion_r1414521731


##
python/pyarrow/array.pxi:
##
@@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray):
 )
 
 
+cdef class VariableShapeTensorArray(ExtensionArray):
+"""
+Concrete class for variable shape tensor extension arrays.
+
+Examples
+
+Define the extension type for tensor array
+
+>>> import pyarrow as pa
+>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2)
+
+Create an extension array
+
+>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2))
+>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32()))
+>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", 
"data"])
+>>> pa.ExtensionArray.from_storage(tensor_type, arr)
+
+-- is_valid: all not null
+-- child 0 type: fixed_size_list[2]
+  [
+[
+  2,
+  3
+],
+[
+  1,
+  2
+]
+  ]
+-- child 1 type: list
+  [
+[
+  1,
+  2,
+  3,
+  4,
+  5,
+  6
+],
+[
+  7,
+  8
+]
+  ]
+"""
+
+def to_numpy_ndarray(self):

Review Comment:
   Or maybe it's better to remove this for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(r): Provide LinkingTo headers for extension packages [arrow-nanoarrow]

2023-12-04 Thread via GitHub


eddelbuettel commented on PR #332:
URL: https://github.com/apache/arrow-nanoarrow/pull/332#issuecomment-1839499034

   We are making use of Arrow exports from C++ using just the `void*` pointer.  
What we have was written before / without `nanoarrow` but I have started to 
bring it on the R bindings site.  There are a number of things there I'd like 
to revisit / extend. From my looking around at some of the packages I see using 
`nanoarrow` (including `adbc*` and `duckdb`) I have the feeling that there are 
commonalities: those packages may benefit too. 
   
   I quite like `nanoarrow` but (if I may) also find it really frustrating at 
times because it essentially is three somewhat distinct and non-overlapping 
parts.  First, the vendorable header / c file pair which are great.  Then the R 
package which is nice for some things from R but which, frustratingly does not 
(did not?) make any tools available for re-use in conjunction with the 
vendorable part.  And then there is the budding Python package which may one 
day cover what the R package does today, and also provide object creation help.
   
   Now this PR looks promising/  I will look more closely at it and then 
release 0.4.0.  This clearly seems to move in the right direction so three 
cheers for that!   If you want to brainstorm / chat for a few minutes I am sure 
we can find some time around the two hours of timezone difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]

2023-12-04 Thread via GitHub


rok commented on code in PR #38008:
URL: https://github.com/apache/arrow/pull/38008#discussion_r1414513505


##
python/pyarrow/array.pxi:
##
@@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray):
 )
 
 
+cdef class VariableShapeTensorArray(ExtensionArray):
+"""
+Concrete class for variable shape tensor extension arrays.
+
+Examples
+
+Define the extension type for tensor array
+
+>>> import pyarrow as pa
+>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2)
+
+Create an extension array
+
+>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2))
+>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32()))
+>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", 
"data"])
+>>> pa.ExtensionArray.from_storage(tensor_type, arr)
+
+-- is_valid: all not null
+-- child 0 type: fixed_size_list[2]
+  [
+[
+  2,
+  3
+],
+[
+  1,
+  2
+]
+  ]
+-- child 1 type: list
+  [
+[
+  1,
+  2,
+  3,
+  4,
+  5,
+  6
+],
+[
+  7,
+  8
+]
+  ]
+"""
+
+def to_numpy_ndarray(self):

Review Comment:
   Oh, I assumed array of arrays was not techincally possible. Will change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38998: [Java] Build memory-core and memory-unsafe as JPMS modules [arrow]

2023-12-04 Thread via GitHub


jduo commented on PR #39011:
URL: https://github.com/apache/arrow/pull/39011#issuecomment-1839493249

   It seems these changes are not usable without modularizing memory-core. 
arrow-vector makes use of Preconditions and BufferAllocator heavily. Since 
these are declared in an unnamed module (memory-core), the user has to add 
--open-reads=org.apache.arrow.vector=ALL-UNNAMED. This doesn't fail when 
running arrow-vector tests because surefire adds this automatically for this 
module but fails any other test that depends on arrow-vector.
   
   The Preconditions class could be moved to a new utils module, but that 
doesn't deal with BufferAllocator which is very much part of memory-core out of 
necessity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Static pkg-config file needs `-framework Security` on MacOS [arrow]

2023-12-04 Thread via GitHub


assignUser commented on issue #38861:
URL: https://github.com/apache/arrow/issues/38861#issuecomment-1839486823

   @jeroen the fix will be in 14.0.2 :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38997: [Java] Modularize format and vector and add plugin for compiling module-info files with JDK8 builds [arrow]

2023-12-04 Thread via GitHub


lidavidm commented on PR #38995:
URL: https://github.com/apache/arrow/pull/38995#issuecomment-1839475501

   This is on my backlog - I'm hoping to give this a review this week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Consider merging static libarrow builds into universal binaries [arrow]

2023-12-04 Thread via GitHub


assignUser commented on issue #39068:
URL: https://github.com/apache/arrow/issues/39068#issuecomment-1839469637

   Yeah makes sense, we do something similar with windows  and ucrt/non-ucrt 
after all. IIRC we use universal binaries for the pyarrow wheels as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] feat(r): Provide LinkingTo headers for extension packages [arrow-nanoarrow]

2023-12-04 Thread via GitHub


paleolimbot opened a new pull request, #332:
URL: https://github.com/apache/arrow-nanoarrow/pull/332

   In developing geoarrow, an R extension that imports/exports a number of 
Arrow C data structures wrapped by nanoarrow S3 objects, it has become apparent 
that the sanitize and allocate operations are non-trivial and basically have to 
be copied in every package that wants to import/export nanoarrow things. 
@eddelbuettel has run up against some valid use-cases as well! 
https://github.com/eddelbuettel/linesplitter/issues/1 , 
https://github.com/apache/arrow-nanoarrow/issues/187
   
   This PR makes the definition of how Arrow C Data interface objects are 
encoded as R external pointers available as a public header (such that 
downstream packages can `LinkingTo: nanoarrow` and `#include `. 
I think the initial target will just be allocate an owning external pointer and 
sanitize an input SEXP.
   
   @eddelbuettel Are there any other operations that are blocking any of your 
projects that would be must-haves in this header?
   
   (I know it's missing array_stream...I forgot about the ability to supply R 
finalizers and so it's a slightly more complicated change)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-37359: [C#] Add ToList() to Decimal128Array and Decimal256Array [arrow]

2023-12-04 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #37383:
URL: https://github.com/apache/arrow/pull/37383#issuecomment-1839459596

   After merging your PR, Conbench analyzed the 4 benchmarking runs that have 
been run so far on merge-commit d357d2d8e2a7210a7e9c1833dfe11821f3f0d349.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/19303699402) 
has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] enable field-order-agnostic overloads of `fromarrow` for struct types [arrow-julia]

2023-12-04 Thread via GitHub


jrevels commented on PR #493:
URL: https://github.com/apache/arrow-julia/pull/493#issuecomment-1839458969

   Did some benchmarking, looks like there is a small perf hit here to the 
general case that needs to be mitigated before this can be merged:
   
   ```jl
   using Arrow, ArrowTypes, Random, BenchmarkTools
   
   struct Foo
   a::Int
   b::String
   c::Vector{String}
   d::Float64
   end
   
   ArrowTypes.arrowname(::Type{Foo}) = Symbol("JuliaLang.Foo")
   ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.Foo")}, T) = Foo
   
   genfoo() = Foo(rand(1:10), randstring(10), [randstring(10) for _ in 
1:rand(2:5)], rand())
   t = (; f = [genfoo() for _ in 1:1000])
   f = Arrow.Table(Arrow.tobuffer(t)).f
   @benchmark sum(x -> x.d, $f)
   ```
   
   results on `jrevels:jr/structelement`
   
   ```
   BenchmarkTools.Trial: 6891 samples with 1 evaluation.
Range (min … max):  631.291 μs …   2.410 ms  ┊ GC (min … max):  0.00% … 
70.30%
Time  (median): 646.250 μs   ┊ GC (median): 0.00%
Time  (mean ± σ):   724.383 μs ± 299.599 μs  ┊ GC (mean ± σ):  10.04% ± 
15.20%
   
 █▅▃▁  ▂▂  ▁
 ▆▆▅▄▁▁▁▃▁▃███ █
 631 μsHistogram: log(frequency) by time   1.95 ms <
   
Memory estimate: 2.53 MiB, allocs estimate: 22470.
   ```
   
   results on `main`
   
   ```
   BenchmarkTools.Trial: 8118 samples with 1 evaluation.
Range (min … max):  531.667 μs …   2.338 ms  ┊ GC (min … max): 0.00% … 
73.71%
Time  (median): 550.333 μs   ┊ GC (median):0.00%
Time  (mean ± σ):   614.904 μs ± 266.473 μs  ┊ GC (mean ± σ):  9.82% ± 
14.78%
   
 █▇▄▂  ▁▂▁ ▁
 ▇▆▄▃▁▁███ █
 532 μsHistogram: log(frequency) by time   1.78 ms <
   
Memory estimate: 2.23 MiB, allocs estimate: 18482.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Implement ODBC driver "wrapper" using FlightSQL [arrow]

2023-12-04 Thread via GitHub


lidavidm commented on issue #30622:
URL: https://github.com/apache/arrow/issues/30622#issuecomment-1839415183

   @alinaliBQ how about `cpp/src/flightsql_odbc` or similar? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]

2023-12-04 Thread via GitHub


jorisvandenbossche commented on code in PR #38008:
URL: https://github.com/apache/arrow/pull/38008#discussion_r1414433628


##
python/pyarrow/array.pxi:
##
@@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray):
 )
 
 
+cdef class VariableShapeTensorArray(ExtensionArray):
+"""
+Concrete class for variable shape tensor extension arrays.
+
+Examples
+
+Define the extension type for tensor array
+
+>>> import pyarrow as pa
+>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2)
+
+Create an extension array
+
+>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2))
+>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32()))
+>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", 
"data"])
+>>> pa.ExtensionArray.from_storage(tensor_type, arr)
+
+-- is_valid: all not null
+-- child 0 type: fixed_size_list[2]
+  [
+[
+  2,
+  3
+],
+[
+  1,
+  2
+]
+  ]
+-- child 1 type: list
+  [
+[
+  1,
+  2,
+  3,
+  4,
+  5,
+  6
+],
+[
+  7,
+  8
+]
+  ]
+"""
+
+def to_numpy_ndarray(self):

Review Comment:
   Is this actually needed as a method to start with? 
   
   My expectation is that `to_numpy()` should ideally return this, but just as 
an object-dtype numpy array of numpy arrays, instead of a list of numpy arrays. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Add Projection to FilterExec [arrow-datafusion]

2023-12-04 Thread via GitHub


Dandandan opened a new pull request, #8416:
URL: https://github.com/apache/arrow-datafusion/pull/8416

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-15058: [C++][Python] Native support for UUID [arrow]

2023-12-04 Thread via GitHub


arogozhnikov commented on PR #37298:
URL: https://github.com/apache/arrow/pull/37298#issuecomment-1839384913

   Hi @rok , can you please summarize where you're stuck at with UUIDs? I see 
some installation failures


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(extensions/nanoarrow_ipc): Add integration testing utility executable [arrow-nanoarrow]

2023-12-04 Thread via GitHub


codecov-commenter commented on PR #330:
URL: https://github.com/apache/arrow-nanoarrow/pull/330#issuecomment-1839382020

   ## 
[Codecov](https://app.codecov.io/gh/apache/arrow-nanoarrow/pull/330?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Comparison is base 
[(`4744498`)](https://app.codecov.io/gh/apache/arrow-nanoarrow/commit/4744498daf166bc3ae8d34d157a2598002447d4e?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 87.83% compared to head 
[(`ecbeb37`)](https://app.codecov.io/gh/apache/arrow-nanoarrow/pull/330?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 89.10%.
   > Report is 3 commits behind head on main.
   
   
   Additional details and impacted files
   
   
   ```diff
   @@Coverage Diff @@
   ## main #330  +/-   ##
   ==
   + Coverage   87.83%   89.10%   +1.27% 
   ==
 Files  724  -68 
 Lines   4  101   -11013 
   ==
   - Hits 9762   90-9672 
   + Misses   1352   11-1341 
   ```
   
   
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/arrow-nanoarrow/pull/330?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >