Re: [I] SQL: Ambiguous reference when aliasing in combination with ORDER BY [arrow-datafusion]
Asura7969 commented on issue #8391: URL: https://github.com/apache/arrow-datafusion/issues/8391#issuecomment-1840186240 Do you have any plans to fix this bug? If you don't have time I'll follow up @Jesse-Bakker -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38923: [GLib] Fix spelling [arrow]
kou commented on code in PR #38924: URL: https://github.com/apache/arrow/pull/38924#discussion_r1415033694 ## c_glib/test/test-array.rb: ## @@ -141,12 +141,12 @@ def test_no_diff def test_diff array = build_string_array(["Start", "Shutdown", "Reboot"]) - other_array = build_string_array(["Start", "Shutdonw", "Reboot"]) + other_array = build_string_array(["Start", "Shutdown_", "Reboot"]) assert_equal(<<-STRING.chomp, array.diff_unified(other_array)) @@ -1, +1 @@ -"Shutdown" -+"Shutdonw" ++"Shutdown_" Review Comment: I took a look at this test again. We should not change this because it's a test to detect difference of two arrays. The "Shutdonw" typo is for the difference. Or we can use other entirely different word something like: ```diff diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb index c03aecf17..4da641b20 100644 --- a/c_glib/test/test-array.rb +++ b/c_glib/test/test-array.rb @@ -141,12 +141,12 @@ class TestArray < Test::Unit::TestCase def test_diff array = build_string_array(["Start", "Shutdown", "Reboot"]) - other_array = build_string_array(["Start", "Shutdonw", "Reboot"]) + other_array = build_string_array(["Start", "Running", "Reboot"]) assert_equal(<<-STRING.chomp, array.diff_unified(other_array)) @@ -1, +1 @@ -"Shutdown" -+"Shutdonw" ++"Running" STRING end ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][FS][Azure] Should `GetFileInfo()` against a directory always return true without hierarchical namespace support? [arrow]
Tom-Newton commented on issue #38772: URL: https://github.com/apache/arrow/issues/38772#issuecomment-1840166133 > @felipecrv @Tom-Newton Do either of you want to work on this? I might be able to do it in about a week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Java][FlightSQL] In authentication, how does flight server get the IP/Hostname of ADBC client [arrow]
xinyiZzz commented on issue #38911: URL: https://github.com/apache/arrow/issues/38911#issuecomment-1840114848 I hope for help, thanks~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] cretae s3 fialed [arrow-datafusion]
Jefffrey commented on issue #8421: URL: https://github.com/apache/arrow-datafusion/issues/8421#issuecomment-1840107703 Looks to be a regression introduced by https://github.com/apache/arrow-datafusion/pull/7390 Specifically: https://github.com/apache/arrow-datafusion/blob/d1554c85c0da2342c5247ed22a728617e8f69142/datafusion/common/src/file_options/parquet_writer.rs#L208 Can probably relax this strict check to allow unknown keys since they could be used for other stuff, like S3 object store Same could potentially be done for JSON and CSV too: https://github.com/apache/arrow-datafusion/blob/d1554c85c0da2342c5247ed22a728617e8f69142/datafusion/common/src/file_options/json_writer.rs#L54 https://github.com/apache/arrow-datafusion/blob/d1554c85c0da2342c5247ed22a728617e8f69142/datafusion/common/src/file_options/csv_writer.rs#L100 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Failed to build docker image [arrow-ballista]
monkeymq commented on issue #912: URL: https://github.com/apache/arrow-ballista/issues/912#issuecomment-1840091400 @zhangyuan Thank you for your suggestion that solved part of my problem, I say this because I also need to set permissions to some scripts before performing any steps (after git clone of course), i.e. `chmod a+rwx ./dev/docker/* .sh`, this works for me. BTW, documentation for Ballista is a bit old and incorrect, which puts some people off. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-39050: [C++] Use Cast() instead of CastTo() for Timestamp Scalar in test [arrow]
conbench-apache-arrow[bot] commented on PR #39060: URL: https://github.com/apache/arrow/pull/39060#issuecomment-1840053073 After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit c39a2230b931a009c34e2a4885b453302c02b233. There were no benchmark performance regressions. 🎉 The [full Conbench report](https://github.com/apache/arrow/runs/19315635400) has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] String compare for Substrait roundtrip check [arrow-datafusion]
tgujar commented on issue #8361: URL: https://github.com/apache/arrow-datafusion/issues/8361#issuecomment-1840047758 Plan 1 is the produced Logical plan, and Plan 2 is Plan1 converted to substrait and then converted back to Logical plan. I think the issue arises because of differences in what can be expressed by substrait grammar vs the logical plan generated by Datafusion. For tests where the plans differ I think using [assert_expected_plan](https://github.com/apache/arrow-datafusion/blob/aeb012e78115bf69d9a58407ac24bc20bd9e0bf0/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L781C10-L781C30) instead of [roundtrip_with_ctx](https://github.com/apache/arrow-datafusion/blob/aeb012e78115bf69d9a58407ac24bc20bd9e0bf0/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L832) should work after manual inspection. I think I want to look into this further but maybe we could normalize both plans somehow and then check for equality -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38705: [C++][FS][Azure] Implement CopyFile() [arrow]
kou commented on PR #39058: URL: https://github.com/apache/arrow/pull/39058#issuecomment-1840039735 Updated: * Accept only blob path for destination like other implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Bump actions/labeler from 4.3.0 to 5.0.0 [arrow-ballista]
dependabot[bot] commented on PR #923: URL: https://github.com/apache/arrow-ballista/pull/923#issuecomment-1840015487 The following labels could not be found: `auto-dependencies`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Bump actions/labeler from 4.3.0 to 5.0.0 [arrow-ballista]
dependabot[bot] opened a new pull request, #923: URL: https://github.com/apache/arrow-ballista/pull/923 Bumps [actions/labeler](https://github.com/actions/labeler) from 4.3.0 to 5.0.0. Release notes Sourced from https://github.com/actions/labeler/releases";>actions/labeler's releases. v5.0.0 What's Changed This release contains the following breaking changes: The ability to apply labels based on the names of base and/or head branches was added (https://redirect.github.com/actions/labeler/issues/186";>#186 and https://redirect.github.com/actions/labeler/issues/54";>#54). The match object for changed files was expanded with new combinations in order to make it more intuitive and flexible (https://redirect.github.com/actions/labeler/issues/423";>#423 and https://redirect.github.com/actions/labeler/issues/101";>#101). As a result, the configuration file structure was significantly redesigned and is not compatible with the structure of the previous version. Please read the https://github.com/actions/labeler/tree/main#pull-request-labeler";>action documentation to find out how to adapt your configuration files for use with the new action version. The bug related to the sync-labels input was fixed (https://redirect.github.com/actions/labeler/issues/112";>#112). Now the input value is read correctly. By default, dot input is set to true. Now, paths starting with a dot (e.g. .github) are matched by default. Version 5 of this action updated the https://docs.github.com/en/actions/creating-actions/metadata-syntax-for-github-actions#runs-for-javascript-actions";>runtime to Node.js 20. All scripts are now run with Node.js 20 instead of Node.js 16 and are affected by any breaking changes between Node.js 16 and 20. For more information, please read the https://github.com/actions/labeler/tree/main#pull-request-labeler";>action documentation. New Contributors https://github.com/joshdales";>@joshdales made their first contribution in https://redirect.github.com/actions/labeler/pull/203";>actions/labeler#203 https://github.com/dusan-trickovic";>@dusan-trickovic made their first contribution in https://redirect.github.com/actions/labeler/pull/626";>actions/labeler#626 https://github.com/sungh0lim";>@sungh0lim made their first contribution in https://redirect.github.com/actions/labeler/pull/630";>actions/labeler#630 https://github.com/TrianguloY";>@TrianguloY made their first contribution in https://redirect.github.com/actions/labeler/pull/629";>actions/labeler#629 Full Changelog: https://github.com/actions/labeler/compare/v4...v5.0.0";>https://github.com/actions/labeler/compare/v4...v5.0.0 v5.0.0-beta.1 What's Changed In scope of this beta release, the structure of the configuration file (.github/labeler.yml) was changed from LabelName: - any: - changed-files: ['list', 'of', 'globs'] - base-branch: ['list', 'of', 'regexps'] - head-branch: ['list', 'of', 'regexps'] - all: - changed-files: ['list', 'of', 'globs'] - base-branch: ['list', 'of', 'regexps'] - head-branch: ['list', 'of', 'regexps'] to LabelName: - any: - changed-files: - AnyGlobToAnyFile: ['list', 'of', 'globs'] - AnyGlobToAllFiles: ['list', 'of', 'globs'] - AllGlobsToAnyFile: ['list', 'of', 'globs'] - AllGlobsToAllFiles: ['list', 'of', 'globs'] - base-branch: ['list', 'of', 'regexps'] - head-branch: ['list', 'of', 'regexps'] - all: - changed-files: - AnyGlobToAnyFile: ['list', 'of', 'globs'] - AnyGlobToAllFiles: ['list', 'of', 'globs'] - AllGlobsToAnyFile: ['list', 'of', 'globs'] ... (truncated) Commits https://github.com/actions/labeler/commit/8558fd74291d67161a8a78ce36a881fa63b766a9";>8558fd7 Merge pull request https://redirect.github.com/actions/labeler/issues/709";>#709 from actions/v5.0.0-beta https://github.com/actions/labeler/commit/000ca75fe6c5838c790ca73b764419065c1594a6";>000ca75 Merge pull request https://redirect.github.com/actions/labeler/issues/700";>#700 from MaksimZhukov/apply-suggestions-and-update-docume... https://github.com/actions/labeler/commit/cb66c2f0788d382da1dabd06a094c0bc6ed3e26a";>cb66c2f Update dist https://github.com/actions/labeler/commit/9181355e36dc8e434c93ba5aaa33f699c4162f38";>9181355 Apply suggestions for the beta vesrion and update the documentation https://github.com/actions/labeler/commit/efe4c1c90edf0ec238b5ee13e66e1abcbbe7446e";>efe4c1c Merge pull request https://redirect.github.com/actions/labeler/issues/699";>#699 from MaksimZhukov/update-node-runtime-and-dependencies https://github.com/actions/labeler/commit/c0957ad7c30fb0638e275122d51df2330459854a";>c0957ad Run Prettier https://github.com/actions/labeler/commit/8dc8d1842f2f3ed1cf6f4190490ad02e0a755f0c";>8dc8d18 Update Node.js version i
Re: [PR] test(csharp/test/Drivers/Snowflake): refactor snowflake test framework [arrow-adbc]
ruowan commented on PR #1323: URL: https://github.com/apache/arrow-adbc/pull/1323#issuecomment-1840012998 Resolved. I think we can merge this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove macro in iter_to_array for List [arrow-datafusion]
viirya commented on code in PR #8414: URL: https://github.com/apache/arrow-datafusion/pull/8414#discussion_r1414844779 ## datafusion/common/src/scalar.rs: ## @@ -3616,6 +3216,19 @@ mod tests { Some(vec![Some(4), Some(5)]), ]); assert_eq!(list_array, &expected); + +let scalars = build_list::(vec![ +vec![Some(1), Some(2), Some(3)], +vec![Some(4), Some(5)], +]); Review Comment: Maybe we can add tests that contain null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove macro in iter_to_array for List [arrow-datafusion]
viirya commented on code in PR #8414: URL: https://github.com/apache/arrow-datafusion/pull/8414#discussion_r1414844561 ## datafusion/common/src/scalar.rs: ## @@ -1368,103 +1366,26 @@ impl ScalarValue { }}; } -macro_rules! build_array_list_primitive { -($ARRAY_TY:ident, $SCALAR_TY:ident, $NATIVE_TYPE:ident, $LIST_TY:ident, $SCALAR_LIST:pat) => {{ -Ok::(Arc::new($LIST_TY::from_iter_primitive::<$ARRAY_TY, _, _>( -scalars.into_iter().map(|x| match x{ -ScalarValue::List(arr) if matches!(x, $SCALAR_LIST) => { -// `ScalarValue::List` contains a single element `ListArray`. -let list_arr = as_list_array(&arr); -if list_arr.is_null(0) { -Ok(None) -} else { -let primitive_arr = - list_arr.values().as_primitive::<$ARRAY_TY>(); -Ok(Some( - primitive_arr.into_iter().collect::>>(), -)) -} -} -ScalarValue::LargeList(arr) if matches!(x, $SCALAR_LIST) =>{ -// `ScalarValue::List` contains a single element `ListArray`. -let list_arr = as_large_list_array(&arr); -if list_arr.is_null(0) { -Ok(None) -} else { -let primitive_arr = - list_arr.values().as_primitive::<$ARRAY_TY>(); -Ok(Some( - primitive_arr.into_iter().collect::>>(), -)) -} -} -sv => _internal_err!( -"Inconsistent types in ScalarValue::iter_to_array. \ -Expected {:?}, got {:?}", -data_type, sv -), -}) -.collect::>>()?, -))) -}}; -} - -macro_rules! build_array_list_string { -($BUILDER:ident, $STRING_ARRAY:ident,$LIST_BUILDER:ident,$SCALAR_LIST:pat) => {{ -let mut builder = $LIST_BUILDER::new($BUILDER::new()); -for scalar in scalars.into_iter() { -match scalar { -ScalarValue::List(arr) if matches!(scalar, $SCALAR_LIST) => { -// `ScalarValue::List` contains a single element `ListArray`. -let list_arr = as_list_array(&arr); - -if list_arr.is_null(0) { -builder.append(false); -continue; -} - -let string_arr = $STRING_ARRAY(list_arr.values()); - -for v in string_arr.iter() { -if let Some(v) = v { -builder.values().append_value(v); -} else { -builder.values().append_null(); -} -} -builder.append(true); -} -ScalarValue::LargeList(arr) if matches!(scalar, $SCALAR_LIST) => { -// `ScalarValue::List` contains a single element `ListArray`. -let list_arr = as_large_list_array(&arr); - -if list_arr.is_null(0) { -builder.append(false); -continue; -} - -let string_arr = $STRING_ARRAY(list_arr.values()); - -for v in string_arr.iter() { -if let Some(v) = v { -builder.values().append_value(v); -} else { -builder.values().append_null(); -} -} -builder.append(true); -} -sv => { -return _internal_err!( -"Inconsistent types in ScalarValue::iter_to_array. \ -Expected List, got {:?}", -sv -) -} -} -} -A
Re: [I] [C++][Gandiva] Migration JIT engine from MCJIT to LLJIT [arrow]
niyue commented on issue #37848: URL: https://github.com/apache/arrow/issues/37848#issuecomment-1839951586 Here is the discussion thread: https://lists.apache.org/thread/fphzvtr1jrc069z7kv78oopgr4zrjfgl -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: [CI] Bump actions/setup-java from 3 to 4 [arrow]
conbench-apache-arrow[bot] commented on PR #39063: URL: https://github.com/apache/arrow/pull/39063#issuecomment-1839944650 After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6958f218528c9cd04ce63152bffb8ff86f42bd15. There were no benchmark performance regressions. 🎉 The [full Conbench report](https://github.com/apache/arrow/runs/19313381468) has more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] cretae s3 fialed [arrow-datafusion]
web3creator opened a new issue, #8421: URL: https://github.com/apache/arrow-datafusion/issues/8421 ### Describe the bug datafusion-cli ❯ CREATE EXTERNAL TABLE test STORED AS PARQUET OPTIONS( 'region' 'xxx', 'access_key_id' '', 'secret_access_key' '' ) LOCATION 'x'; Invalid or Unsupported Configuration: Found unsupported option access_key_id with value for Parquet format! ❯ ### To Reproduce datafusion-cli ``` datafusion-cli ❯ CREATE EXTERNAL TABLE test STORED AS PARQUET OPTIONS( 'region' 'xxx', 'access_key_id' '', 'secret_access_key' '' ) LOCATION 'x'; Invalid or Unsupported Configuration: Found unsupported option access_key_id with value for Parquet format! ❯ ``` ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat: ScalarValue from String [arrow-datafusion]
QuenKar commented on code in PR #8411: URL: https://github.com/apache/arrow-datafusion/pull/8411#discussion_r1414788401 ## datafusion/common/src/scalar.rs: ## @@ -4645,6 +4651,16 @@ mod tests { ); } +#[test] +fn test_scalar_value_from_string() { +let scalar = ScalarValue::from("foo"); +assert_eq!(scalar, ScalarValue::Utf8(Some("foo".to_string(; +let scalar = ScalarValue::from("foo".to_string()); +assert_eq!(scalar, ScalarValue::Utf8(Some("foo".to_string(; Review Comment: I find that there is no code similar to `"foo".to_string().as_str() `or `some_string.as_str()` or `&some_string` or `as_str().into()` in this use(But maybe I don't find all of them... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-39050: [C++] Use Cast() instead of CastTo() for Timestamp Scalar in test [arrow]
kou merged PR #39060: URL: https://github.com/apache/arrow/pull/39060 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
kou commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1413052434 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_results]
Re: [PR] GH-37884: [Swift] allow reading of unaligned FlatBuffers buffers [arrow]
kou commented on code in PR #38635: URL: https://github.com/apache/arrow/pull/38635#discussion_r1414754609 ## swift/Arrow/Package.swift: ## @@ -32,7 +32,7 @@ let package = Package( targets: ["Arrow"]), ], dependencies: [ -.package(url: "https://github.com/google/flatbuffers.git";, from: "23.3.3") +.package(url: "https://github.com/google/flatbuffers.git";, branch: "master") Review Comment: Could you add a comment why we need to use `master` for now and when we can specify a tag again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414751422 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_re
Re: [PR] GH-38309: [C++] build filesystems as separate modules [arrow]
kou commented on code in PR #39067: URL: https://github.com/apache/arrow/pull/39067#discussion_r1414747502 ## cpp/src/arrow/filesystem/filesystem.cc: ## @@ -738,6 +761,109 @@ Result> FileSystemFromUriReal(const Uri& uri, } // namespace +Status RegisterFileSystemFactory(std::vector schemes, + FileSystem::Factory factory) { + auto& [mutex, scheme_to_factory] = FileSystemFactoryRegistry(); + std::unique_lock lock{mutex}; + + for (auto&& scheme : schemes) { +if (auto& ref = scheme_to_factory[scheme]) { + return Status::KeyError( + "Tried to add a factory for ", scheme, + ":// uris, but a factory was already registered for that scheme"); +} else { + ref = factory; +} + } + return Status::OK(); +} + +// XXX alternative approach: +// If we distrust global constructors more than we distrust +// symbol visibility, we could export specially named functions +// and retrieve them via dlsym. +Status RegisterFileSystemFactoryModule(std::string module_name) { + static auto* handle = dlopen(nullptr, RTLD_NOW | RTLD_LOCAL); Review Comment: Is this changed to `dlopen(std::string("libarrow_fs_") + module_name + ".so", ...)` or something eventually, right? ## cpp/src/arrow/filesystem/filesystem.cc: ## @@ -677,12 +684,28 @@ Status CopyFiles(const std::shared_ptr& source_fs, namespace { +auto& FileSystemFactoryRegistry() { + static struct { +std::recursive_mutex mutex; +std::unordered_map scheme_to_factory; + } registry; + return registry; +} + Result> FileSystemFromUriReal(const Uri& uri, const std::string& uri_string, const io::IOContext& io_context, std::string* out_path) { const auto scheme = uri.scheme(); + { +auto& [mutex, scheme_to_factory] = FileSystemFactoryRegistry(); +std::unique_lock lock{mutex}; Review Comment: Can we use `shared_lock` here? It seems that this doesn't change the registry. ## cpp/src/arrow/filesystem/filesystem.cc: ## @@ -738,6 +761,109 @@ Result> FileSystemFromUriReal(const Uri& uri, } // namespace +Status RegisterFileSystemFactory(std::vector schemes, + FileSystem::Factory factory) { + auto& [mutex, scheme_to_factory] = FileSystemFactoryRegistry(); + std::unique_lock lock{mutex}; + + for (auto&& scheme : schemes) { +if (auto& ref = scheme_to_factory[scheme]) { + return Status::KeyError( + "Tried to add a factory for ", scheme, + ":// uris, but a factory was already registered for that scheme"); +} else { + ref = factory; +} + } + return Status::OK(); +} + +// XXX alternative approach: Review Comment: I like this approach. Because hiding `dlopen()`/`LoadLibrary()` details from users is easy to use. (Users need to call `dlopen()`/`LoadLibrary()` explicitly with the `Registrar` approach, right?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414744853 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; Review Comment: This is needed because the Azure server might break the page after a "dir/", before
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on PR #39009: URL: https://github.com/apache/arrow/pull/39009#issuecomment-1839860421 I rebased to fix a conflict, but the new changes are on the last two commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 [arrow-datafusion]
alamb closed issue #8329: DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 URL: https://github.com/apache/arrow-datafusion/issues/8329 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 [arrow-datafusion]
alamb commented on issue #8329: URL: https://github.com/apache/arrow-datafusion/issues/8329#issuecomment-1839853479 Next Week: https://github.com/apache/arrow-datafusion/issues/8420 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 [arrow-datafusion]
alamb commented on issue #8329: URL: https://github.com/apache/arrow-datafusion/issues/8329#issuecomment-1839853226 > Thanks @alamb ! > > Many people have asked me how to get involved in the DataFusion community and what they can do for the project (Because I've previously promoted DataFusion on the Chinese social media). This summary is very helpful for them. Thank you for this feedback. It is very helpful. Other than writing up more detailed tickets, do you know of other things that would help? > I have pinned this issue to make it easy for everyone to view. I saw this -- that was very cool. Thank you. I will pin future such issues. > Express gratitude once again for all that you have done for the community. Thank you very much @jackwener - I am always amazed by all the work we do together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 [arrow-datafusion]
alamb opened a new issue, #8420: URL: https://github.com/apache/arrow-datafusion/issues/8420 Follow on to [Nov 27 2023](https://github.com/apache/arrow-datafusion/issues/8329): ## Boilerplate Overview The idea of this ticket is make my plans for DataFusion visible, largely for my own personal organizational needs, but also to: 1. Try some different ways to communicate / coordinate in the community 2. Help provide an interesting summary of what is happening in DataFusion this week It would be great if anyone else who has plans like this for DataFusion could try to make them visible somehow as well ## My (personal) plans for this week - [ ] Complete https://github.com/apache/arrow-datafusion/issues/8376 - [ ] Reproduce and file several bugs / limitations we have hit in Influx recently ## Project Queue (list of future projects) - [ ] https://github.com/apache/arrow-datafusion/issues/8227 (specifically, https://github.com/apache/arrow-datafusion/issues/8229 as a step towards https://github.com/apache/arrow-datafusion/issues/8078) - [ ] https://github.com/apache/arrow-datafusion/issues/8130, in support of memory explosion in HashJoin outputs: https://github.com/apache/arrow-datafusion/issues/7848 with @korowa ## Additional major projects I hope to help review and coordinate - [ ] Next steps towards function packages more modular (make all BuiltInScalarFunctions ScalarUDFs): https://github.com/apache/arrow-datafusion/issues/8045 with @2010YOUY01 . This week I hope to get https://github.com/apache/arrow-datafusion/pull/8046 mergable, which would require https://github.com/apache/arrow-datafusion/issues/8157) - [ ] `ARRAY` function hardening: https://github.com/apache/arrow-datafusion/issues/6980 / https://github.com/apache/arrow-datafusion/issues/7988 - [ ] Review any new ticket / bug reports # Algorithm for (my) prioritizing PR reviews Note there are many [committers](https://arrow.apache.org/committers/) who can review and merge PRs as well, so this is not the priorities of the project as a whole, just the approximate algorithm I am using Priority: 1. Bug fixes (where something is just incorrect), especially regressions (where it used to work and now does not) 2. Improvements directly related to features needed for InfluxDB (as I am employed by them) 3. Documentation and test improvements (I view these as very strategically important) 4. PRs that I think are strategically important 5. Other new features / additions to functionality The current strategically important projects in my head are: * https://github.com/apache/arrow-datafusion/issues/8045 * Improving performance of high cardinality aggregation Note that adding new array functionality (e.g. https://github.com/apache/arrow-datafusion/issues/6980 and its children) is very low on my priority list, not because I don't think they are valuable, but because I think they are less valuable than the other projects going on - it would be great if other people could help pick these up too. Thus, if you are interested in contributing to DataFusion and are interested in a fast turn around time I would recommend looking into bug fixes / test improvements / documentation / etc. If you propose adding new functionality, the review cycle will likely be longer. You can make it a shorter cycle by looking at the comments on other recent PRs and following the same model (e.g. ensure there are tests in sqllogictest for example) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]
domoritz commented on PR #35780: URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839849622 Hmm, then we can't really automatically flush in the vector builder. That's a bummer. Maybe then I need to implement a chunked buffer that can satisfy the `DataBuffer` type. I need to look at what implications that may have. Maybe @trxcllnt has some suggestion here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: [CI] Bump actions/setup-java from 3 to 4 [arrow]
kou merged PR #39063: URL: https://github.com/apache/arrow/pull/39063 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: [CI] Bump actions/labeler from 4.3.0 to 5.0.0 [arrow]
kou commented on PR #39062: URL: https://github.com/apache/arrow/pull/39062#issuecomment-1839847476 Hmm. It seems that we need to change our configuration files for this something like the following: ```diff diff --git a/.github/workflows/dev_pr/labeler.yml b/.github/workflows/dev_pr/labeler.yml index c31d1f309..bb4d261f8 100644 --- a/.github/workflows/dev_pr/labeler.yml +++ b/.github/workflows/dev_pr/labeler.yml @@ -16,7 +16,9 @@ # under the License. "Component: C++": - - cpp/**/* + - any: +- any-glob-to-any-file: + - cpp/**/* "Component: GLib": - c_glib/**/* ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414734379 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; Review Comment: Something I prefer to do once all PRs land because it would be a noisy change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414734002 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); Review Comment: Code evolves in mysterious ways when it's late into the workday 🤡 It's gonna be `ConcatAbstractPath` everywhere now. ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); Review Comment: Code evolves in mysterious ways when it's late into the workday 🤡 It's gonna be `ConcatAbstractPath` everywhere now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Minor: Improve `PruningPredicate` documentation [arrow-datafusion]
alamb merged PR #8394: URL: https://github.com/apache/arrow-datafusion/pull/8394 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Minor: Add example with parameters to LogicalPlan [arrow-datafusion]
alamb merged PR #8418: URL: https://github.com/apache/arrow-datafusion/pull/8418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]
kylebarron commented on PR #35780: URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839842156 But then you call ```ts table.batches[0].numRows // 1 ``` Implicitly, a batch has a constant number of rows across all `Data` instances. When I test it in Python, surprisingly the `Table` constructor _allows_ it, but then when you call `to_batches` it reaggregates chunks into consistent record batches: ```py import pyarrow as pa d1 = [ pa.array([1, 2], pa.int32()), pa.array([3], pa.int32()), ] d2 = [ pa.array([4], pa.int32()), pa.array([5, 6], pa.int32()), ] c1 = pa.chunked_array(d1) c2 = pa.chunked_array(d2) table = pa.table({"v1": c1, "v2": c2}) table # pyarrow.Table # v1: int32 # v2: int32 # # v1: [[1,2],[3]] # v2: [[4],[5,6]] # to_batches regroups into batches of length 1 table.to_batches() # [pyarrow.RecordBatch # v1: int32 # v2: int32 # # v1: [1] # v2: [4], # pyarrow.RecordBatch # v1: int32 # v2: int32 # # v1: [2] # v2: [5], # pyarrow.RecordBatch # v1: int32 # v2: int32 # # v1: [3] # v2: [6]] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414731605 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_re
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414730259 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_re
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414723747 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, Review Comment: I will change it to `VisitContainers`. `ListContainers` is indeed a bad name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414722781 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -1103,7 +1330,12 @@ Result AzureFileSystem::GetFileInfo(const std::string& path) { } Result AzureFileSystem::GetFileInfo(const FileSelector& select) { - return Status::NotImplemented("The Azure FileSystem is not fully implemented"); + Azure::Core::Context context; + Azure::Nullable page_size_hint; // unspecified + FileInfoVector results; + RETURN_NOT_OK( + impl_->GetFileInfoWithSelector(context, page_size_hint, select, &results)); Review Comment: Since `Impl` doesn't necessarily have to provide overloaded methods like the public interface does I opted for a clearer name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414721290 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_re
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414720550 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -1103,7 +1330,12 @@ Result AzureFileSystem::GetFileInfo(const std::string& path) { } Result AzureFileSystem::GetFileInfo(const FileSelector& select) { - return Status::NotImplemented("The Azure FileSystem is not fully implemented"); + Azure::Core::Context context; + Azure::Nullable page_size_hint; // unspecified Review Comment: I want all the wiring to be in place if we need to specify the `page_size_hint` which I'm pretty sure we will need to tweak when this is used in practice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414718868 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -1103,7 +1330,12 @@ Result AzureFileSystem::GetFileInfo(const std::string& path) { } Result AzureFileSystem::GetFileInfo(const FileSelector& select) { - return Status::NotImplemented("The Azure FileSystem is not fully implemented"); + Azure::Core::Context context; + Azure::Nullable page_size_hint; // unspecified + FileInfoVector results; + RETURN_NOT_OK( + impl_->GetFileInfoWithSelector(context, page_size_hint, select, &results)); + return std::move(results); Review Comment: This is because I want to move the `results` into the `Result` that is being created. It's equivalent to this: ```cpp Result result{std::move(results)}; return result; ``` I will change the line to `return {std::move(results)};` to clarify what is going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414717355 ## cpp/src/arrow/filesystem/path_util.cc: ## @@ -72,15 +72,12 @@ std::string SliceAbstractPath(const std::string& s, int offset, int length, char return ""; } std::vector components = SplitAbstractPath(s, sep); - std::stringstream combined; if (offset >= static_cast(components.size())) { return ""; } - int end = offset + length; - if (end > static_cast(components.size())) { -end = static_cast(components.size()); - } - for (int i = offset; i < end; i++) { + const size_t end = std::min(static_cast(offset) + length, components.size()); Review Comment: Fixing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38705: [C++][FS][Azure] Implement CopyFile() [arrow]
kou commented on PR #39058: URL: https://github.com/apache/arrow/pull/39058#issuecomment-1839825528 > > I think there is a problem with the `CopyFromUri()` API though. > > Nevermind, it looks like you've got a working implementation using `CopyFromUri()`. Maybe it does auth automatically if it detects that the source is in the same storage account. Or possibly the Azure SDK automatically switches between [Copy Blob From URL](https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob-from-url?tabs=microsoft-entra-id) and [Copy Blob](https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob?tabs=microsoft-entra-id) I think that "Copy Blob From URL" and "Copy Blob" are the same API internally. Both of them use the same URI and `PUT`: * Copy Blob From URL: https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob-from-url#request * Copy Blob: https://learn.microsoft.com/en-us/rest/api/storageservices/copy-blob#request And `BlobClient` provides only `CopyFromUri()`. (It doesn't provide `Copy()`.) So I used `CopyFromUri()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414716712 ## cpp/src/arrow/filesystem/azurefs_test.cc: ## @@ -479,6 +522,133 @@ TEST_F(AzureHierarchicalNamespaceFileSystemTest, GetFileInfoObject) { RunGetFileInfoObjectTest(); } +TEST_F(AzuriteFileSystemTest, GetFileInfoSelector) { + SetUpSmallFileSystemTree(); + + FileSelector select; + std::vector infos; + + // Root dir + select.base_dir = ""; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 3); + ASSERT_EQ(infos, SortedInfos(infos)); + AssertFileInfo(infos[0], "container", FileType::Directory); + AssertFileInfo(infos[1], "empty-container", FileType::Directory); + AssertFileInfo(infos[2], container_name_, FileType::Directory); + + // Empty container + select.base_dir = "empty-container"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 0); + // Nonexistent container + select.base_dir = "nonexistent-container"; + ASSERT_RAISES(IOError, fs_->GetFileInfo(select)); + select.allow_not_found = true; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 0); + select.allow_not_found = false; + // Non-empty container + select.base_dir = "container"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos, SortedInfos(infos)); + ASSERT_EQ(infos.size(), 4); + AssertFileInfo(infos[0], "container/emptydir", FileType::Directory); + AssertFileInfo(infos[1], "container/otherdir", FileType::Directory); + AssertFileInfo(infos[2], "container/somedir", FileType::Directory); + AssertFileInfo(infos[3], "container/somefile", FileType::File, 9); + + // Empty "directory" + select.base_dir = "container/emptydir"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 0); + // Non-empty "directories" + select.base_dir = "container/somedir"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 1); + AssertFileInfo(infos[0], "container/somedir/subdir", FileType::Directory); + select.base_dir = "container/somedir/subdir"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 1); + AssertFileInfo(infos[0], "container/somedir/subdir/subfile", FileType::File, 8); + // Nonexistent + select.base_dir = "container/nonexistent"; + ASSERT_RAISES(IOError, fs_->GetFileInfo(select)); + select.allow_not_found = true; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 0); + select.allow_not_found = false; + + // Trailing slashes + select.base_dir = "empty-container/"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 0); + select.base_dir = "nonexistent-container/"; + ASSERT_RAISES(IOError, fs_->GetFileInfo(select)); + select.base_dir = "container/"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos, SortedInfos(infos)); + ASSERT_EQ(infos.size(), 4); +} + +TEST_F(AzuriteFileSystemTest, GetFileInfoSelectorRecursive) { + SetUpSmallFileSystemTree(); + + FileSelector select; + select.recursive = true; + + std::vector infos; + // Root dir + select.base_dir = ""; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 14); Review Comment: I know. This very cheap check I added to ensure simple mistakes in the code don't immediately lead you to the more confusing error messages produced by the code that follows it: ```cpp ASSERT_EQ(infos, SortedInfos(infos)); AssertInfoAllContainersRecursive(infos); ``` I don't think checking the length 3 times costs anything significant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] SQL: Ambiguous reference when aliasing in combination with ORDER BY [arrow-datafusion]
Asura7969 commented on issue #8391: URL: https://github.com/apache/arrow-datafusion/issues/8391#issuecomment-1839821571 Do you have any plans to fix this bug? If you don't have time I'll follow up @Jesse-Bakker -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414712577 ## cpp/src/arrow/filesystem/azurefs_test.cc: ## @@ -479,6 +522,133 @@ TEST_F(AzureHierarchicalNamespaceFileSystemTest, GetFileInfoObject) { RunGetFileInfoObjectTest(); } +TEST_F(AzuriteFileSystemTest, GetFileInfoSelector) { + SetUpSmallFileSystemTree(); + + FileSelector select; + std::vector infos; + + // Root dir + select.base_dir = ""; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 3); + ASSERT_EQ(infos, SortedInfos(infos)); + AssertFileInfo(infos[0], "container", FileType::Directory); + AssertFileInfo(infos[1], "empty-container", FileType::Directory); + AssertFileInfo(infos[2], container_name_, FileType::Directory); + + // Empty container + select.base_dir = "empty-container"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 0); + // Nonexistent container + select.base_dir = "nonexistent-container"; + ASSERT_RAISES(IOError, fs_->GetFileInfo(select)); + select.allow_not_found = true; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos.size(), 0); + select.allow_not_found = false; + // Non-empty container + select.base_dir = "container"; + ASSERT_OK_AND_ASSIGN(infos, fs_->GetFileInfo(select)); + ASSERT_EQ(infos, SortedInfos(infos)); + ASSERT_EQ(infos.size(), 4); + AssertFileInfo(infos[0], "container/emptydir", FileType::Directory); + AssertFileInfo(infos[1], "container/otherdir", FileType::Directory); + AssertFileInfo(infos[2], "container/somedir", FileType::Directory); + AssertFileInfo(infos[3], "container/somefile", FileType::File, 9); Review Comment: Single-use constants are very common in tests. I don't see how putting these in a constant to be used only once (an unnecessary indirection) can help the person debugging these unit tests in case they start failing. Do you have name suggestions for the constants? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-37884: [Swift] allow reading of unaligned FlatBuffers buffers [arrow]
abandy commented on PR #38635: URL: https://github.com/apache/arrow/pull/38635#issuecomment-1839817261 @kou Please review when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414708333 ## cpp/src/arrow/filesystem/azurefs_test.cc: ## @@ -287,6 +287,49 @@ class AzureFileSystemTest : public ::testing::Test { void RunGetFileInfoObjectWithNestedStructureTest(); void RunGetFileInfoObjectTest(); + + void SetUpSmallFileSystemTree() { +// Set up test containers + blob_service_client_->GetBlobContainerClient("empty-container").CreateIfNotExists(); + +auto container_client = blob_service_client_->GetBlobContainerClient("container"); +container_client.CreateIfNotExists(); + +auto blob_client = container_client.GetBlockBlobClient("emptydir/"); +blob_client.UploadFrom(reinterpret_cast(""), 0); + +blob_client = container_client.GetBlockBlobClient("somedir/subdir/subfile"); +const char* sub_data = "sub data"; +blob_client.UploadFrom(reinterpret_cast(sub_data), strlen(sub_data)); + +blob_client = container_client.GetBlockBlobClient("somefile"); +const char* some_data = "some data"; +blob_client.UploadFrom(reinterpret_cast(some_data), + strlen(some_data)); + +blob_client = container_client.GetBlockBlobClient("otherdir/1/2/3/otherfile"); +const char* other_data = "other data"; +blob_client.UploadFrom(reinterpret_cast(other_data), + strlen(other_data)); Review Comment: `fs_` is the abstraction on top of the blob client, if I use the abstraction to produce test data bugs bugs could cancel each other out. Even when that is not the case, unit test failures are much harder to debug if you use the code being tested to also produce the ground truth test data you use to test the code. The s3fs tests use the S3 SDK directly to produce test data and I suppose it's for these reasons. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414706522 ## cpp/src/arrow/filesystem/azurefs_test.cc: ## @@ -213,7 +213,7 @@ class AzureFileSystemTest : public ::testing::Test { suite_skipped_ = true; GTEST_SKIP() << options.status().message(); } -container_name_ = RandomChars(32); +container_name_ = "z" + RandomChars(31); Review Comment: The `Prexisting*` paths are added automatically by `SetUp`. I want it to start with `z` to make all paths sort deterministically when other paths are added which is guaranteed by my paths starting with letters before `z`. The right solution here is to have every test populate the filesystem explicitly and make `SetUp` simpler, but I don't feel like doing it now when so many people are adding and changing the tests in the same file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414704929 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_re
Re: [PR] GH-15060: [JS] Add LargeUtf8 type [arrow]
domoritz commented on PR #35780: URL: https://github.com/apache/arrow/pull/35780#issuecomment-1839808240 Is it actually an issue that chunks have different lengths? Seems to work fine here ```js const b1 = makeBuilder({ type: new Int32 }); const b2 = makeBuilder({ type: new Int32 }); const d1 = [] as Data[]; const d2 = [] as Data[]; b1.append(1); b1.append(2); d1.push(b1.flush()); b1.append(3); d1.push(b1.flush()); b2.append(4); d2.push(b2.flush()); b2.append(5); b2.append(6); d2.push(b2.flush()); const v1 = new Vector(d1); const v2 = new Vector(d2); const table = new Table({ v1, v2 }); console.log(table.toArray()); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][FS][Azure] Should `GetFileInfo()` against a directory always return true without hierarchical namespace support? [arrow]
kou commented on issue #38772: URL: https://github.com/apache/arrow/issues/38772#issuecomment-1839807333 @felipecrv @Tom-Newton Do either of you want to work on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38705: [C++][FS][Azure] Implement CopyFile() [arrow]
kou commented on PR #39058: URL: https://github.com/apache/arrow/pull/39058#issuecomment-1839806315 Oh, sorry. I didn't read the docstring. I'll remove the behavior. > Also, at some point the Azure fs implementation will have to implement and pass the [generic filesystem tests](https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/filesystem/test_util.h#L120-L245), which do have a test [for this situation](https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/filesystem/test_util.cc#L555-L558). I've opened a new issue for it: #39069 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414700135 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_re
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414699614 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; + found = true; // Unless the container itself is not found later! +} else { + options.Prefix = internal::EnsureTrailingSlash(base_location.path); +} +options.PageSizeHint = page_size_hint; +options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata; + +// When Prefix.Value() contains a trailing slash and we find a blob that +// matches it completely, it is an empty directory marker blob for the +// directory we're listing from, and we should skip it. +auto is_empty_dir_marker = +[&options](const Azure::Storage::Blobs::Models::BlobItem& blob) noexcept -> bool { + return options.Prefix.HasValue() && blob.Name == options.Prefix.Value(); +}; + +auto recurse = [&](const std::string& blob_prefix) noexcept -> Status { + if (select.recursive && select.max_recursion > 0) { +FileSelector sub_select; +sub_select.base_dir = base_location.container; +sub_select.base_dir += internal::kSep; +sub_select.base_dir += internal::RemoveTrailingSlash(blob_prefix); +sub_select.allow_not_found = true; +sub_select.recursive = true; +sub_select.max_recursion = select.max_recursion - 1; +return GetFileInfoWithSelectorFromContainer( +container_client, context, page_size_hint, sub_select, acc_results); + } + return Status::OK(); +}; + +// (*acc_results)[*last_dir_reported] is the last FileType::Directory in the results +// produced through this loop over the response pages. +std::optional last_dir_reported{}; +auto matches_last_dir_reported = [&last_dir_reported, + acc_re
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414698838 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } + + /// \brief List the blobs at the root of a container or some dir in a container. + /// + /// \pre container_client is the client for the container named like the first + /// segment of select.base_dir. + Status GetFileInfoWithSelectorFromContainer( + const Azure::Storage::Blobs::BlobContainerClient& container_client, + const Azure::Core::Context& context, Azure::Nullable page_size_hint, + const FileSelector& select, FileInfoVector* acc_results) { +ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir)); + +bool found = false; +Azure::Storage::Blobs::ListBlobsOptions options; +if (internal::GetAbstractPathDepth(base_location.path) == 0) { + // If the base_dir is the root of the container, then we want to list all blobs in + // the container and the Prefix should be empty and not even include the trailing + // slash because the container itself represents the `/` directory. + options.Prefix = {}; Review Comment: The compiler can elide it, so its reason of existing is to anchor the comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414697321 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } Review Comment: I'm changing it to this: ```cpp static std::string_view BasenameView(std::string_view s) { DCHECK(!internal::HasTrailingSlash(s)); auto offset = s.find_last_of(internal::kSep); auto result = (offset == std::string_view::npos) ? s : s.substr(offset); DCHECK(!result.empty() && result.back() != internal::kSep); return result; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414680091 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; +FileInfo info{std::move(path), FileType::File}; +info.set_size(blob.BlobSize); + info.set_mtime(std::chrono::system_clock::time_point{blob.Details.LastModified}); +return info; + } + + static FileInfo DirectoryFileInfoFromPath(const std::string& path) { +return FileInfo{std::string{internal::RemoveTrailingSlash(path)}, +FileType::Directory}; + } + + static std::string_view BasenameView(std::string_view s) { +auto offset = s.find_last_of(internal::kSep); +auto tail = (offset == std::string_view::npos) ? s : s.substr(offset); +return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false); + } Review Comment: The problem with `path_util.h` functions is that they don't follow any concept of path normalization for inputs and have to check the input for many variations, this `BasenameView` works well in this context, but not in a general context. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414675638 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; +try { + auto container_list_response = + blob_service_client_->ListBlobContainers(options, context); + for (; container_list_response.HasPage(); + container_list_response.MoveToNextPage(context)) { +for (const auto& container : container_list_response.BlobContainers) { + RETURN_NOT_OK(on_container(container)); +} + } +} catch (const Azure::Storage::StorageException& exception) { + return internal::ExceptionToStatus("Failed to list account containers.", exception); +} +return Status::OK(); + } + + static FileInfo FileInfoFromBlob(const std::string& container, + const Azure::Storage::Blobs::Models::BlobItem& blob) { +if (blob.Name.back() == internal::kSep) { + return DirectoryFileInfoFromPath(container + internal::kSep + blob.Name); +} +std::string path; +path.reserve(container.size() + 1 + blob.Name.size()); +path += container; +path += internal::kSep; +path += blob.Name; Review Comment: My goal here was to not rely on the semantic of the path_util utilities which can be very confusing with all the checks and indirections, but I will use `ConcatAbstractPath` now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] fix: Literal in window definition should not an ordinal refering to relation column [arrow-datafusion]
viirya opened a new pull request, #8419: URL: https://github.com/apache/arrow-datafusion/pull/8419 ## Which issue does this PR close? Closes #. ## Rationale for this change This continues to fix one bug found in #8410. In particular, currently a literal value in `ORDER BY` clause of window definition will be treated as an ordinal reference to a column in the relation which I think is the behavior of sort expressions in `ORDER BY` keyword. However for window definition, Postgres simply treats a literal value as literal: ``` postgres=# select a, rank() over (order by 1 RANGE between UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk from (select 1 a union select 2 a) q; a | rnk ---+- 1 | 1 2 | 1 (2 rows) postgres=# select a, rank() over (order by 1 RANGE between 1 PRECEDING AND 1 FOLLOWING) rnk from (select 1 a union select 2 a) q; a | rnk ---+- 1 | 1 2 | 1 (2 rows) ``` ## What changes are included in this PR? Making a literal value in `ORDER BY` window definition as literal, instead of a column ordinal. ## Are these changes tested? Modified test in `sqllogictest`. ## Are there any user-facing changes? Literal in `ORDER BY` window definition is changed to literal instead of a column ordinal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem [arrow]
felipecrv commented on code in PR #39009: URL: https://github.com/apache/arrow/pull/39009#discussion_r1414655422 ## cpp/src/arrow/filesystem/azurefs.cc: ## @@ -815,6 +815,233 @@ class AzureFileSystem::Impl { } } + private: + template + Status ListContainers(const Azure::Core::Context& context, +OnContainer&& on_container) const { +Azure::Storage::Blobs::ListBlobContainersOptions options; +// Deleted containers are not returned. +options.Include = Azure::Storage::Blobs::Models::ListBlobContainersIncludeFlags::None; Review Comment: I wanted to leave a placeholder for extension later (if needed), but I will delete these two lines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: [CI] Bump actions/setup-dotnet from 3 to 4 [arrow]
conbench-apache-arrow[bot] commented on PR #39061: URL: https://github.com/apache/arrow/pull/39061#issuecomment-1839737398 After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 87fbf8990671d254ca77c36c99f24e773c6fdafd. There were no benchmark performance regressions. 🎉 The [full Conbench report](https://github.com/apache/arrow/runs/19308528838) has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame [arrow-datafusion]
viirya merged PR #8410: URL: https://github.com/apache/arrow-datafusion/pull/8410 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame [arrow-datafusion]
viirya commented on PR #8410: URL: https://github.com/apache/arrow-datafusion/pull/8410#issuecomment-1839713617 Thanks @jackwener -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Minor: Add installation link to README.md [arrow-datafusion]
comphead merged PR #8389: URL: https://github.com/apache/arrow-datafusion/pull/8389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Make it easier to install / get started with DataFusion [arrow-datafusion]
comphead closed issue #7297: Make it easier to install / get started with DataFusion URL: https://github.com/apache/arrow-datafusion/issues/7297 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] enable field-order-agnostic overloads of `fromarrow` for struct types [arrow-julia]
jrevels commented on PR #493: URL: https://github.com/apache/arrow-julia/pull/493#issuecomment-1839654384 Alright, I changed the approach here after profiling different candidate approaches. Looks like leaving `NamedTuple` construction (or even just any intermediate struct construction) in the critical path here borks `getindex` performance, annoyingly, so ended up not going that route. Users of this feature can do still construct `NamedTuple`s themselves, if needed, but this new approach avoids doing so on the default path. on this PR now: ```jl BenchmarkTools.Trial: 8170 samples with 1 evaluation. Range (min … max): 530.500 μs … 2.318 ms ┊ GC (min … max): 0.00% … 74.22% Time (median): 547.833 μs ┊ GC (median):0.00% Time (mean ± σ): 610.966 μs ± 263.948 μs ┊ GC (mean ± σ): 9.75% ± 14.73% █▆▃▁ ▁▃▁ ▁ ▇▇▅▃▅▁▁▃▁▄███ █ 530 μsHistogram: log(frequency) by time 1.75 ms < Memory estimate: 2.22 MiB, allocs estimate: 18387. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Consolidate statistics aggregation [arrow-datafusion]
alamb commented on issue #8229: URL: https://github.com/apache/arrow-datafusion/issues/8229#issuecomment-1839645692 https://github.com/apache/arrow-datafusion/pull/8254 has the basics in place, but I need to work on other higher priority items this week and next so this has been pushed lower in my priority list -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Support AVRO Format for Write Queries [arrow-datafusion]
alamb commented on issue #7679: URL: https://github.com/apache/arrow-datafusion/issues/7679#issuecomment-1839611958 > @alamb Hi, I can help with this ticket : ) Thanks @Veeupup -- for this one the first thing we need to do is get an avro writer and the second part is hooking it into DataFusion. There is some work upstream in arrow-rs to make an avro reader/writer in https://github.com/apache/arrow-rs/issues/4886 but I think that project is stilled at the moment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Easy way to zero-copy IPC buffers. [arrow-rs]
comath opened a new pull request, #5165: URL: https://github.com/apache/arrow-rs/pull/5165 # Which issue does this PR close? Closes #5153 # Rationale for this change There are ways to memmap buffers, if you know the header information, but no easy way to use the existing IPC file format in a zero-copy way. # What changes are included in this PR? Reused the IPC reader code to read the header and footer information from an IPC file, but swapped out the read/copy of the buffer for the Buffer creation with `Buffer::from_custom_allocation`. Minor savings can be had by not allocating the arbitrary-sized footer and message read, but the easy way to do this requires a nightly API in the Cursor object. This duplicated some code, but it's messy to deduplicate because most of it is boilerplate around reading the flatbuffer messages. The better way to deduplicate is to change the `FileReader` to optionally accept a buffer and conditionally use that in the 2 locations where buffers are actually read. Perfectly happy to do it this way as well. # Are there any user-facing changes? There's a new struct that mirrors the current IPC reader struct that passes back zero-copy buffers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [R] nanoarrow as an interchange class? [arrow-nanoarrow]
paleolimbot commented on issue #331: URL: https://github.com/apache/arrow-nanoarrow/issues/331#issuecomment-1839578943 I *think* that is a good idea, although I might need a more specific example to more concretely comment. Today you might have to use `as_nanoarrow_array()` as a backup. It may not be quite the same, but in the `geos` and `s2` and `geoarrow` packages I fall back on `wk_handle()`'s generic (which avoids most of those packages having to know about any of the others). I'd like to move those to start using `as_nanoarrow_array_stream()` in a way like you described. In general, the pattern I envisioned for interchange is: - Producers that produce objects that can be interpreted as arrays implement an `as_nanoarrow_array()` S3 method (e.g., `arrow::Array` has a method for `as_nanoarrow_array()`) - Consumers that accept arrays call `as_nanoarrow_array()` to sanitize the input. The `as_nanoarrow_array_stream()` version is slightly more generic (e.g., can accommodate streaming results and/or Chunked arrays), but isn't *quite* where it should be (e.g., if you call `as_nanoarrow_array_stream()` on an `arrow::ChunkedArray` today, I think you would get an error, even though it *should* work). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Minor: Add example with parameters to LogicalPlan [arrow-datafusion]
alamb opened a new pull request, #8418: URL: https://github.com/apache/arrow-datafusion/pull/8418 ## Which issue does this PR close? Follow on to https://github.com/apache/arrow-datafusion/pull/8384 ## Rationale for this change Add an example in the Doc so it is slightly easier to figure out how to use named query parameter ## What changes are included in this PR? Extra doc example ## Are these changes tested? Yes, doc test ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] chore: some debug info [arrow-datafusion]
NGA-TRAN opened a new pull request, #8417: URL: https://github.com/apache/arrow-datafusion/pull/8417 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Projection to FilterExec [arrow-datafusion]
Dandandan closed pull request #8416: Add Projection to FilterExec URL: https://github.com/apache/arrow-datafusion/pull/8416 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Projection to FilterExec [arrow-datafusion]
Dandandan commented on PR #8416: URL: https://github.com/apache/arrow-datafusion/pull/8416#issuecomment-1839535102 This approach is not going to work at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-15058: [C++][Python] Native support for UUID [arrow]
rok commented on PR #37298: URL: https://github.com/apache/arrow/pull/37298#issuecomment-1839526314 @arogozhnikov I think it was mostly about lack of reviews :) I've rebased, let's see what current problems are. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Support named query parameters [arrow-datafusion]
alamb merged PR #8384: URL: https://github.com/apache/arrow-datafusion/pull/8384 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Support named query parameters [arrow-datafusion]
alamb closed issue #8245: Support named query parameters URL: https://github.com/apache/arrow-datafusion/issues/8245 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Support named query parameters [arrow-datafusion]
alamb commented on PR #8384: URL: https://github.com/apache/arrow-datafusion/pull/8384#issuecomment-1839525992 Thanks @Asura7969 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Make filter selectivity for statistics configurable [arrow-datafusion]
alamb commented on code in PR #8243: URL: https://github.com/apache/arrow-datafusion/pull/8243#discussion_r1414535729 ## datafusion/physical-plan/src/filter.rs: ## @@ -994,4 +1014,22 @@ mod tests { Ok(()) } + +#[tokio::test] +async fn test_validation_filter_selectivity() -> Result<()> { Review Comment: > Not sure however how [this](https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/filter.rs#L221) should be handled That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like `sin(x) = 4.0`. ## datafusion/physical-plan/src/filter.rs: ## @@ -994,4 +1014,54 @@ mod tests { Ok(()) } + +#[tokio::test] +async fn test_validation_filter_selectivity() -> Result<()> { +let schema = Schema::new(vec![Field::new("a", DataType::Int32, false)]); +let input = Arc::new(StatisticsExec::new( +Statistics::new_unknown(&schema), +schema, +)); +// WHERE a = 10 +let predicate = Arc::new(BinaryExpr::new( +Arc::new(Column::new("a", 0)), +Operator::Eq, +Arc::new(Literal::new(ScalarValue::Int32(Some(10, +)); +let filter = FilterExec::try_new(predicate, input)?; +assert!(filter.with_default_selectivity(120).is_err()); +Ok(()) +} + +#[tokio::test] +async fn test_custom_filter_selectivity() -> Result<()> { +// Need a decimal to trigger inexact selectivity +let schema = +Schema::new(vec![Field::new("a", DataType::Decimal128(2, 3), false)]); +let input = Arc::new(StatisticsExec::new( +Statistics { +num_rows: Precision::Inexact(1000), +total_byte_size: Precision::Inexact(4000), +column_statistics: vec![ColumnStatistics { +..Default::default() +}], +}, +schema, +)); +// WHERE a = 10 +let predicate = Arc::new(BinaryExpr::new( +Arc::new(Column::new("a", 0)), +Operator::Eq, +Arc::new(Literal::new(ScalarValue::Decimal128(Some(10), 10, 10))), +)); +let filter = FilterExec::try_new(predicate, input)?; +let statistics = filter.statistics()?; +assert_eq!(statistics.num_rows, Precision::Inexact(200)); Review Comment: 👌 very nice ## datafusion/physical-plan/src/filter.rs: ## @@ -994,4 +1014,54 @@ mod tests { Ok(()) } + +#[tokio::test] +async fn test_validation_filter_selectivity() -> Result<()> { +let schema = Schema::new(vec![Field::new("a", DataType::Int32, false)]); +let input = Arc::new(StatisticsExec::new( +Statistics::new_unknown(&schema), +schema, +)); +// WHERE a = 10 +let predicate = Arc::new(BinaryExpr::new( +Arc::new(Column::new("a", 0)), +Operator::Eq, +Arc::new(Literal::new(ScalarValue::Int32(Some(10, +)); +let filter = FilterExec::try_new(predicate, input)?; +assert!(filter.with_default_selectivity(120).is_err()); +Ok(()) +} + +#[tokio::test] +async fn test_custom_filter_selectivity() -> Result<()> { +// Need a decimal to trigger inexact selectivity +let schema = +Schema::new(vec![Field::new("a", DataType::Decimal128(2, 3), false)]); +let input = Arc::new(StatisticsExec::new( +Statistics { +num_rows: Precision::Inexact(1000), +total_byte_size: Precision::Inexact(4000), +column_statistics: vec![ColumnStatistics { +..Default::default() +}], +}, +schema, +)); +// WHERE a = 10 +let predicate = Arc::new(BinaryExpr::new( +Arc::new(Column::new("a", 0)), +Operator::Eq, +Arc::new(Literal::new(ScalarValue::Decimal128(Some(10), 10, 10))), +)); +let filter = FilterExec::try_new(predicate, input)?; +let statistics = filter.statistics()?; +assert_eq!(statistics.num_rows, Precision::Inexact(200)); Review Comment: 👌 very nice -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: [CI] Bump actions/setup-dotnet from 3 to 4 [arrow]
kou merged PR #39061: URL: https://github.com/apache/arrow/pull/39061 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]
jorisvandenbossche commented on code in PR #38008: URL: https://github.com/apache/arrow/pull/38008#discussion_r1414528595 ## python/pyarrow/array.pxi: ## @@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray): ) +cdef class VariableShapeTensorArray(ExtensionArray): +""" +Concrete class for variable shape tensor extension arrays. + +Examples + +Define the extension type for tensor array + +>>> import pyarrow as pa +>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2) + +Create an extension array + +>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2)) +>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32())) +>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", "data"]) +>>> pa.ExtensionArray.from_storage(tensor_type, arr) + +-- is_valid: all not null +-- child 0 type: fixed_size_list[2] + [ +[ + 2, + 3 +], +[ + 1, + 2 +] + ] +-- child 1 type: list + [ +[ + 1, + 2, + 3, + 4, + 5, + 6 +], +[ + 7, + 8 +] + ] +""" + +def to_numpy_ndarray(self): Review Comment: BTW this method also has a different return type than the `FixedShapeTensorArray.to_numpy_ndarray` we added, so that's another reason to use a different name (or just remove it) For FixedShapeTensorArray the idea was that we wanted to be able to return a single nD array, which is possible for fixed shape, but not for variable shape. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]
rok commented on code in PR #38008: URL: https://github.com/apache/arrow/pull/38008#discussion_r1414521731 ## python/pyarrow/array.pxi: ## @@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray): ) +cdef class VariableShapeTensorArray(ExtensionArray): +""" +Concrete class for variable shape tensor extension arrays. + +Examples + +Define the extension type for tensor array + +>>> import pyarrow as pa +>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2) + +Create an extension array + +>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2)) +>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32())) +>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", "data"]) +>>> pa.ExtensionArray.from_storage(tensor_type, arr) + +-- is_valid: all not null +-- child 0 type: fixed_size_list[2] + [ +[ + 2, + 3 +], +[ + 1, + 2 +] + ] +-- child 1 type: list + [ +[ + 1, + 2, + 3, + 4, + 5, + 6 +], +[ + 7, + 8 +] + ] +""" + +def to_numpy_ndarray(self): Review Comment: Or maybe it's better to remove this for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(r): Provide LinkingTo headers for extension packages [arrow-nanoarrow]
eddelbuettel commented on PR #332: URL: https://github.com/apache/arrow-nanoarrow/pull/332#issuecomment-1839499034 We are making use of Arrow exports from C++ using just the `void*` pointer. What we have was written before / without `nanoarrow` but I have started to bring it on the R bindings site. There are a number of things there I'd like to revisit / extend. From my looking around at some of the packages I see using `nanoarrow` (including `adbc*` and `duckdb`) I have the feeling that there are commonalities: those packages may benefit too. I quite like `nanoarrow` but (if I may) also find it really frustrating at times because it essentially is three somewhat distinct and non-overlapping parts. First, the vendorable header / c file pair which are great. Then the R package which is nice for some things from R but which, frustratingly does not (did not?) make any tools available for re-use in conjunction with the vendorable part. And then there is the budding Python package which may one day cover what the R package does today, and also provide object creation help. Now this PR looks promising/ I will look more closely at it and then release 0.4.0. This clearly seems to move in the right direction so three cheers for that! If you want to brainstorm / chat for a few minutes I am sure we can find some time around the two hours of timezone difference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]
rok commented on code in PR #38008: URL: https://github.com/apache/arrow/pull/38008#discussion_r1414513505 ## python/pyarrow/array.pxi: ## @@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray): ) +cdef class VariableShapeTensorArray(ExtensionArray): +""" +Concrete class for variable shape tensor extension arrays. + +Examples + +Define the extension type for tensor array + +>>> import pyarrow as pa +>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2) + +Create an extension array + +>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2)) +>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32())) +>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", "data"]) +>>> pa.ExtensionArray.from_storage(tensor_type, arr) + +-- is_valid: all not null +-- child 0 type: fixed_size_list[2] + [ +[ + 2, + 3 +], +[ + 1, + 2 +] + ] +-- child 1 type: list + [ +[ + 1, + 2, + 3, + 4, + 5, + 6 +], +[ + 7, + 8 +] + ] +""" + +def to_numpy_ndarray(self): Review Comment: Oh, I assumed array of arrays was not techincally possible. Will change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38998: [Java] Build memory-core and memory-unsafe as JPMS modules [arrow]
jduo commented on PR #39011: URL: https://github.com/apache/arrow/pull/39011#issuecomment-1839493249 It seems these changes are not usable without modularizing memory-core. arrow-vector makes use of Preconditions and BufferAllocator heavily. Since these are declared in an unnamed module (memory-core), the user has to add --open-reads=org.apache.arrow.vector=ALL-UNNAMED. This doesn't fail when running arrow-vector tests because surefire adds this automatically for this module but fails any other test that depends on arrow-vector. The Preconditions class could be moved to a new utils module, but that doesn't deal with BufferAllocator which is very much part of memory-core out of necessity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Static pkg-config file needs `-framework Security` on MacOS [arrow]
assignUser commented on issue #38861: URL: https://github.com/apache/arrow/issues/38861#issuecomment-1839486823 @jeroen the fix will be in 14.0.2 :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38997: [Java] Modularize format and vector and add plugin for compiling module-info files with JDK8 builds [arrow]
lidavidm commented on PR #38995: URL: https://github.com/apache/arrow/pull/38995#issuecomment-1839475501 This is on my backlog - I'm hoping to give this a review this week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Consider merging static libarrow builds into universal binaries [arrow]
assignUser commented on issue #39068: URL: https://github.com/apache/arrow/issues/39068#issuecomment-1839469637 Yeah makes sense, we do something similar with windows and ucrt/non-ucrt after all. IIRC we use universal binaries for the pyarrow wheels as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] feat(r): Provide LinkingTo headers for extension packages [arrow-nanoarrow]
paleolimbot opened a new pull request, #332: URL: https://github.com/apache/arrow-nanoarrow/pull/332 In developing geoarrow, an R extension that imports/exports a number of Arrow C data structures wrapped by nanoarrow S3 objects, it has become apparent that the sanitize and allocate operations are non-trivial and basically have to be copied in every package that wants to import/export nanoarrow things. @eddelbuettel has run up against some valid use-cases as well! https://github.com/eddelbuettel/linesplitter/issues/1 , https://github.com/apache/arrow-nanoarrow/issues/187 This PR makes the definition of how Arrow C Data interface objects are encoded as R external pointers available as a public header (such that downstream packages can `LinkingTo: nanoarrow` and `#include `. I think the initial target will just be allocate an owning external pointer and sanitize an input SEXP. @eddelbuettel Are there any other operations that are blocking any of your projects that would be must-haves in this header? (I know it's missing array_stream...I forgot about the ability to supply R finalizers and so it's a slightly more complicated change) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-37359: [C#] Add ToList() to Decimal128Array and Decimal256Array [arrow]
conbench-apache-arrow[bot] commented on PR #37383: URL: https://github.com/apache/arrow/pull/37383#issuecomment-1839459596 After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit d357d2d8e2a7210a7e9c1833dfe11821f3f0d349. There were no benchmark performance regressions. 🎉 The [full Conbench report](https://github.com/apache/arrow/runs/19303699402) has more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] enable field-order-agnostic overloads of `fromarrow` for struct types [arrow-julia]
jrevels commented on PR #493: URL: https://github.com/apache/arrow-julia/pull/493#issuecomment-1839458969 Did some benchmarking, looks like there is a small perf hit here to the general case that needs to be mitigated before this can be merged: ```jl using Arrow, ArrowTypes, Random, BenchmarkTools struct Foo a::Int b::String c::Vector{String} d::Float64 end ArrowTypes.arrowname(::Type{Foo}) = Symbol("JuliaLang.Foo") ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.Foo")}, T) = Foo genfoo() = Foo(rand(1:10), randstring(10), [randstring(10) for _ in 1:rand(2:5)], rand()) t = (; f = [genfoo() for _ in 1:1000]) f = Arrow.Table(Arrow.tobuffer(t)).f @benchmark sum(x -> x.d, $f) ``` results on `jrevels:jr/structelement` ``` BenchmarkTools.Trial: 6891 samples with 1 evaluation. Range (min … max): 631.291 μs … 2.410 ms ┊ GC (min … max): 0.00% … 70.30% Time (median): 646.250 μs ┊ GC (median): 0.00% Time (mean ± σ): 724.383 μs ± 299.599 μs ┊ GC (mean ± σ): 10.04% ± 15.20% █▅▃▁ ▂▂ ▁ ▆▆▅▄▁▁▁▃▁▃███ █ 631 μsHistogram: log(frequency) by time 1.95 ms < Memory estimate: 2.53 MiB, allocs estimate: 22470. ``` results on `main` ``` BenchmarkTools.Trial: 8118 samples with 1 evaluation. Range (min … max): 531.667 μs … 2.338 ms ┊ GC (min … max): 0.00% … 73.71% Time (median): 550.333 μs ┊ GC (median):0.00% Time (mean ± σ): 614.904 μs ± 266.473 μs ┊ GC (mean ± σ): 9.82% ± 14.78% █▇▄▂ ▁▂▁ ▁ ▇▆▄▃▁▁███ █ 532 μsHistogram: log(frequency) by time 1.78 ms < Memory estimate: 2.23 MiB, allocs estimate: 18482. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++] Implement ODBC driver "wrapper" using FlightSQL [arrow]
lidavidm commented on issue #30622: URL: https://github.com/apache/arrow/issues/30622#issuecomment-1839415183 @alinaliBQ how about `cpp/src/flightsql_odbc` or similar? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-38007: [C++][Python] Add VariableShapeTensor implementation [arrow]
jorisvandenbossche commented on code in PR #38008: URL: https://github.com/apache/arrow/pull/38008#discussion_r1414433628 ## python/pyarrow/array.pxi: ## @@ -3586,6 +3586,156 @@ class FixedShapeTensorArray(ExtensionArray): ) +cdef class VariableShapeTensorArray(ExtensionArray): +""" +Concrete class for variable shape tensor extension arrays. + +Examples + +Define the extension type for tensor array + +>>> import pyarrow as pa +>>> tensor_type = pa.variable_shape_tensor(pa.int32(), 2) + +Create an extension array + +>>> shapes = pa.array([[2, 3], [1, 2]], pa.list_(pa.uint32(), 2)) +>>> values = pa.array([[1, 2, 3, 4, 5, 6], [7, 8]], pa.list_(pa.int32())) +>>> arr = pa.StructArray.from_arrays([shapes, values], names=["shape", "data"]) +>>> pa.ExtensionArray.from_storage(tensor_type, arr) + +-- is_valid: all not null +-- child 0 type: fixed_size_list[2] + [ +[ + 2, + 3 +], +[ + 1, + 2 +] + ] +-- child 1 type: list + [ +[ + 1, + 2, + 3, + 4, + 5, + 6 +], +[ + 7, + 8 +] + ] +""" + +def to_numpy_ndarray(self): Review Comment: Is this actually needed as a method to start with? My expectation is that `to_numpy()` should ideally return this, but just as an object-dtype numpy array of numpy arrays, instead of a list of numpy arrays. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Add Projection to FilterExec [arrow-datafusion]
Dandandan opened a new pull request, #8416: URL: https://github.com/apache/arrow-datafusion/pull/8416 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-15058: [C++][Python] Native support for UUID [arrow]
arogozhnikov commented on PR #37298: URL: https://github.com/apache/arrow/pull/37298#issuecomment-1839384913 Hi @rok , can you please summarize where you're stuck at with UUIDs? I see some installation failures -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] feat(extensions/nanoarrow_ipc): Add integration testing utility executable [arrow-nanoarrow]
codecov-commenter commented on PR #330: URL: https://github.com/apache/arrow-nanoarrow/pull/330#issuecomment-1839382020 ## [Codecov](https://app.codecov.io/gh/apache/arrow-nanoarrow/pull/330?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Comparison is base [(`4744498`)](https://app.codecov.io/gh/apache/arrow-nanoarrow/commit/4744498daf166bc3ae8d34d157a2598002447d4e?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 87.83% compared to head [(`ecbeb37`)](https://app.codecov.io/gh/apache/arrow-nanoarrow/pull/330?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 89.10%. > Report is 3 commits behind head on main. Additional details and impacted files ```diff @@Coverage Diff @@ ## main #330 +/- ## == + Coverage 87.83% 89.10% +1.27% == Files 724 -68 Lines 4 101 -11013 == - Hits 9762 90-9672 + Misses 1352 11-1341 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/arrow-nanoarrow/pull/330?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org