[GitHub] [arrow] litao3rd commented on issue #37840: [C++] get different total rows answer for one dataset

via GitHub Sat, 23 Sep 2023 10:05:10 -0700


litao3rd commented on issue #37840:
URL: https://github.com/apache/arrow/issues/37840#issuecomment-1732365861


   Sorry for my mistake.
   
   I am currently using version 12.0.1 of Arrow.
   
   To simplify the process, I have created a small script for downloading a few 
files that can reproduce this issue. The script will create a directory called 
"tlc-trip-record-data" in the current directory and download the necessary data 
into this directory. Please note that we are behind the Great Firewall, so you 
may need to execute this script with an optional proxy using the following 
command: sh ./download-tcl-trip-record-data.sh --proxy protocol://host:port.
   
   I have encountered a perplexing issue. When I use 6 months of data, the 
first block returns a total of 142,516,648 rows, while the other block returns 
0 rows. However, when I use 5 months of data, excluding January, both blocks 
yield the same result. Unfortunately, I am unable to identify the bug in Arrow 
due to its complexity.
   
   Please note that you need to modify the path to tlc-trip-record-data 
directory in cpp codes.
   
   ``` download-tcl-trip-record-data.sh
   #!/bin/bash
   
   set -e
   
   dataset="tlc-trip-record-data"
   
   test -d $dataset || mkdir $dataset
   
   pushd $dataset
   
   for n in $(seq 1 6); do
       d="2023-0$n"
       echo "[*] downloading data for $d"
       curl $@ -LO 
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_$d.parquet";
       curl $@ -LO 
"https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_$d.parquet";
       curl $@ -LO 
"https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_$d.parquet";
       curl $@ -LO 
"https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_$d.parquet";
       echo "[*] finished download data for $d"
   done
   
   popd
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] litao3rd commented on issue #37840: [C++] get different total rows answer for one dataset

Reply via email to