alihan-synnada opened a new issue, #13896:
URL: https://github.com/apache/datafusion/issues/13896

   ### Describe the bug
   
   Attempting to download the IMDB dataset gives the following error:
   
   ```
   tar: Error opening archive: Unrecognized archive format
   ```
   
   An `IMDB.tgz` is created with the following content:
   
   ```html
   <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
   <html><head>
   <title>404 Not Found</title>
   </head><body>
   <h1>Not Found</h1>
   <p>The requested URL was not found on this server.</p>
   </body></html>
   ```
   
   It seems the dataset is removed or unavailable.
   
   ### To Reproduce
   
   Run `benchmarks/bench.sh data imdb`
   
   ### Expected behavior
   
   It should download the dataset, extract the csv files and convert to parquet.
   
   ### Additional context
   
   The related part in `bench.sh`
   
https://github.com/apache/datafusion/blob/6cfd1cf1e030ccfe3b17621cc51fdcefcceae018/benchmarks/bench.sh#L458-L463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to