[ 
https://issues.apache.org/jira/browse/HBASE-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-25824:
----------------------------------------
    Fix Version/s: 3.0.0-alpha-1
           Status: Patch Available  (was: Open)

> IntegrationTestLoadCommonCrawl
> ------------------------------
>
>                 Key: HBASE-25824
>                 URL: https://issues.apache.org/jira/browse/HBASE-25824
>             Project: HBase
>          Issue Type: Test
>          Components: integration tests
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Minor
>             Fix For: 3.0.0-alpha-1
>
>
> This integration test loads successful resource retrieval records from the 
> Common Crawl (https://commoncrawl.org/) public dataset into an HBase table 
> and writes records that can be used to later verify the presence and 
> integrity of those records.
> Run like:
> {noformat}
> ./bin/hbase org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl \
>   -Dfs.s3n.awsAccessKeyId=<AWS access key> \
>   -Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
>   /path/to/test-CC-MAIN-2021-10-warc.paths.gz \
>   /path/to/tmp/warc-loader-output
>  {noformat}
> Access to the Common Crawl dataset in S3 is made available to anyone by 
> Amazon AWS, but Hadoop's S3N filesystem still requires valid access 
> credentials to initialize.
> The input path can either specify a directory or a file. The file may 
> optionally be compressed with gzip. If a directory, the loader expects the 
> directory to contain one or more WARC files from the Common Crawl dataset. If 
> a file, the loader expects a list of Hadoop S3N URIs which point to S3 
> locations for one or more WARC files from the Common Crawl dataset, one URI 
> per line. Lines should be terminated with the UNIX line terminator.
> Included in hbase-it/src/test/resources/CC-MAIN-2021-10-warc.paths.gz is a 
> list of all WARC files comprising the Q1 2021 crawl archive. There are 64,000 
> WARC files in this data set, each containing ~1GB of gzipped data. The WARC 
> files contain several record types, such as metadata, request, and response, 
> but we only load the response record types. If the HBase table schema does 
> not specify compression (by default) there is roughly a 10x expansion. 
> Loading the full crawl archive results in a table approximately 640 TB in 
> size.
> The hadoop-aws jar will be needed at runtime to instantiate the S3N 
> filesystem. Use the -files ToolRunner argument to add it.
> You can also split the Loader and Verify stages:
> Load with:
> {noformat}
> ./bin/hbase 
> 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Loader' \
>   -files /path/to/hadoop-aws.jar \
>   -Dfs.s3n.awsAccessKeyId=<AWS access key> \
>   -Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
>   /path/to/test-CC-MAIN-2021-10-warc.paths.gz \
>   /path/to/tmp/warc-loader-output
> {noformat}
> Verify with:
> {noformat}
> ./bin/hbase 
> 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Verify' \
>   /path/to/tmp/warc-loader-output
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to