Re: [PR] Add JOB benchmark dataset [1/N] (imdb dataset) [datafusion]

via GitHub Thu, 19 Sep 2024 20:36:01 -0700


doupache commented on PR #12497:
URL: https://github.com/apache/datafusion/pull/12497#issuecomment-2362691758


   Thanks @austin362667  and @alamb. 
   
   I have updated the PR and learned some Cargo tips from @austin362667. 
   Using debug build during development is much faster.
   
   
   ```sh 
   #1
   cd benchmarks && cargo build 
   
   #2 
   cargo run --bin imdb -- convert --input ./data/imdb/ --output ./data/imdb/ 
--format parquet
   ```
   
   
   i also test all 21 parquet  like follwoing.
   
   ```sql 
   # create table 
   CREATE EXTERNAL TABLE name (
       id INTEGER NOT NULL PRIMARY KEY,
       name STRING NOT NULL,
       imdb_index STRING,
       imdb_id INTEGER,
       gender STRING,
       name_pcode_cf STRING,
       name_pcode_nf STRING,
       surname_pcode STRING,
       md5sum STRING
   )
   STORED AS PARQUET
   LOCATION '../benchmarks/data/imdb/temp/name.parquet';
   
   # read 
   SELECT * FROM name LIMIT 5;
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add JOB benchmark dataset [1/N] (imdb dataset) [datafusion]

Reply via email to