Re: [I] open_dataset with root directory inaccessible? [arrow]

via GitHub Tue, 10 Dec 2024 11:40:35 -0800


amoeba commented on issue #44992:
URL: https://github.com/apache/arrow/issues/44992#issuecomment-2532636356


   Hi @sonicseamus, thanks for the question. There is a way to construct the 
Dataset manually, skipping the discovery process `open_dataset` initiates and 
can't do here, but I think you'll immediately run into an issue: Arrow doesn't 
support constructing datasets directly over HTTP, see 
https://github.com/apache/arrow/issues/23849.
   
   That said, you can get partway to what you want (with some workarounds) 
because the underlying Parquet files are on S3 or S3-like storage and Arrow 
does support S3. The one limitation is that you won't be able to take advantage 
of partitioning. If possible, requesting that the MBTA group open up access to 
the bucket root would be a nice improvement here.
   
   ```r
   library(dplyr)
   library(arrow)
   library(stringr)
   
   # First we need to get the list of files
   file_url_prefix <- 
"https://performancedata.mbta.com/lamp/subway-on-time-performance-v1/";
   index_csv_url <- paste0(file_url_prefix, "index.csv")
   
   index_tbl <- read_csv_arrow(index_csv_url) 
   index_tbl$relpath <- str_remove_all(index_tbl$file_url, file_url_prefix)
   
   # Then, to create a Dataset manually, we need to create a Filesystem first
   # Since we don't support HTTP filesystems but this dataset is backed by 
S3-like
   # storage, we can create a custom S3 Filesystem and still use our S3 driver
   fs <- S3FileSystem$create(
     anonymous = TRUE, 
     endpoint_override = "https://performancedata.mbta.com";
   )
   # cd into the right path
   ds_fs <- fs$cd("lamp/subway-on-time-performance-v1")
   
   dsf <- FileSystemDatasetFactory$create(
     filesystem = ds_fs, 
     paths = index_tbl$relpath, 
     format = FileFormat$create("parquet"),
   )
   ds <- dsf$Finish()
   
   # Now we can use the Dataset as normal
   ds |> 
     head() |> 
     collect()
   ```
   
   Gives me:
   
   ```
   # A tibble: 6 × 27
     stop_sequence stop_id   parent_station move_timestamp stop_timestamp 
travel_time_seconds dwell_time_seconds headway_trunk_seconds 
headway_branch_seconds service_date
             <int> <chr>     <chr>                   <int>          <int>       
        <int>              <int>                 <int>                  <int>   
     <int>
   1             1 Oak Grov… place-ogmnl                NA     1568643555       
           NA                 NA                    NA                     NA   
  20190916
   2             1 70210     place-lech         1568643599             NA       
           NA                 NA                    NA                     NA   
  20190916
   3           120 70095     place-jfk                  NA     1568643680       
           NA                 NA                    NA                     NA   
  20190916
   4           650 70209     place-lech                 NA     1568643698       
           NA                 NA                    NA                     NA   
  20190916
   5             1 Alewife-… place-alfcl                NA     1568643730       
           NA                 NA                    NA                     NA   
  20190916
   6           310 70107     place-lake                 NA     1568643731       
           NA                 NA                    NA                     NA   
  20190916
   # ℹ 17 more variables: route_id <chr>, direction_id <lgl>, start_time <int>, 
vehicle_id <chr>, branch_route_id <chr>, trunk_route_id <chr>, stop_count <int>,
   #   trip_id <chr>, vehicle_label <chr>, vehicle_consist <chr>, direction 
<chr>, direction_destination <chr>, scheduled_arrival_time <int>,
   #   scheduled_departure_time <int>, scheduled_travel_time <int>, 
scheduled_headway_branch <int>, scheduled_headway_trunk <int>
   ```
   
   Does the above seem like a workable solution for you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] open_dataset with root directory inaccessible? [arrow]

Reply via email to