amoeba commented on issue #44992: URL: https://github.com/apache/arrow/issues/44992#issuecomment-2532636356
Hi @sonicseamus, thanks for the question. There is a way to construct the Dataset manually, skipping the discovery process `open_dataset` initiates and can't do here, but I think you'll immediately run into an issue: Arrow doesn't support constructing datasets directly over HTTP, see https://github.com/apache/arrow/issues/23849. That said, you can get partway to what you want (with some workarounds) because the underlying Parquet files are on S3 or S3-like storage and Arrow does support S3. The one limitation is that you won't be able to take advantage of partitioning. If possible, requesting that the MBTA group open up access to the bucket root would be a nice improvement here. ```r library(dplyr) library(arrow) library(stringr) # First we need to get the list of files file_url_prefix <- "https://performancedata.mbta.com/lamp/subway-on-time-performance-v1/" index_csv_url <- paste0(file_url_prefix, "index.csv") index_tbl <- read_csv_arrow(index_csv_url) index_tbl$relpath <- str_remove_all(index_tbl$file_url, file_url_prefix) # Then, to create a Dataset manually, we need to create a Filesystem first # Since we don't support HTTP filesystems but this dataset is backed by S3-like # storage, we can create a custom S3 Filesystem and still use our S3 driver fs <- S3FileSystem$create( anonymous = TRUE, endpoint_override = "https://performancedata.mbta.com" ) # cd into the right path ds_fs <- fs$cd("lamp/subway-on-time-performance-v1") dsf <- FileSystemDatasetFactory$create( filesystem = ds_fs, paths = index_tbl$relpath, format = FileFormat$create("parquet"), ) ds <- dsf$Finish() # Now we can use the Dataset as normal ds |> head() |> collect() ``` Gives me: ``` # A tibble: 6 × 27 stop_sequence stop_id parent_station move_timestamp stop_timestamp travel_time_seconds dwell_time_seconds headway_trunk_seconds headway_branch_seconds service_date <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> 1 1 Oak Grov… place-ogmnl NA 1568643555 NA NA NA NA 20190916 2 1 70210 place-lech 1568643599 NA NA NA NA NA 20190916 3 120 70095 place-jfk NA 1568643680 NA NA NA NA 20190916 4 650 70209 place-lech NA 1568643698 NA NA NA NA 20190916 5 1 Alewife-… place-alfcl NA 1568643730 NA NA NA NA 20190916 6 310 70107 place-lake NA 1568643731 NA NA NA NA 20190916 # ℹ 17 more variables: route_id <chr>, direction_id <lgl>, start_time <int>, vehicle_id <chr>, branch_route_id <chr>, trunk_route_id <chr>, stop_count <int>, # trip_id <chr>, vehicle_label <chr>, vehicle_consist <chr>, direction <chr>, direction_destination <chr>, scheduled_arrival_time <int>, # scheduled_departure_time <int>, scheduled_travel_time <int>, scheduled_headway_branch <int>, scheduled_headway_trunk <int> ``` Does the above seem like a workable solution for you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
