[ https://issues.apache.org/jira/browse/HBASE-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Kyle Purtell updated HBASE-25824: ---------------------------------------- Fix Version/s: 3.0.0-alpha-1 Status: Patch Available (was: Open) > IntegrationTestLoadCommonCrawl > ------------------------------ > > Key: HBASE-25824 > URL: https://issues.apache.org/jira/browse/HBASE-25824 > Project: HBase > Issue Type: Test > Components: integration tests > Reporter: Andrew Kyle Purtell > Assignee: Andrew Kyle Purtell > Priority: Minor > Fix For: 3.0.0-alpha-1 > > > This integration test loads successful resource retrieval records from the > Common Crawl (https://commoncrawl.org/) public dataset into an HBase table > and writes records that can be used to later verify the presence and > integrity of those records. > Run like: > {noformat} > ./bin/hbase org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl \ > -Dfs.s3n.awsAccessKeyId=<AWS access key> \ > -Dfs.s3n.awsSecretAccessKey=<AWS secret key> \ > /path/to/test-CC-MAIN-2021-10-warc.paths.gz \ > /path/to/tmp/warc-loader-output > {noformat} > Access to the Common Crawl dataset in S3 is made available to anyone by > Amazon AWS, but Hadoop's S3N filesystem still requires valid access > credentials to initialize. > The input path can either specify a directory or a file. The file may > optionally be compressed with gzip. If a directory, the loader expects the > directory to contain one or more WARC files from the Common Crawl dataset. If > a file, the loader expects a list of Hadoop S3N URIs which point to S3 > locations for one or more WARC files from the Common Crawl dataset, one URI > per line. Lines should be terminated with the UNIX line terminator. > Included in hbase-it/src/test/resources/CC-MAIN-2021-10-warc.paths.gz is a > list of all WARC files comprising the Q1 2021 crawl archive. There are 64,000 > WARC files in this data set, each containing ~1GB of gzipped data. The WARC > files contain several record types, such as metadata, request, and response, > but we only load the response record types. If the HBase table schema does > not specify compression (by default) there is roughly a 10x expansion. > Loading the full crawl archive results in a table approximately 640 TB in > size. > The hadoop-aws jar will be needed at runtime to instantiate the S3N > filesystem. Use the -files ToolRunner argument to add it. > You can also split the Loader and Verify stages: > Load with: > {noformat} > ./bin/hbase > 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Loader' \ > -files /path/to/hadoop-aws.jar \ > -Dfs.s3n.awsAccessKeyId=<AWS access key> \ > -Dfs.s3n.awsSecretAccessKey=<AWS secret key> \ > /path/to/test-CC-MAIN-2021-10-warc.paths.gz \ > /path/to/tmp/warc-loader-output > {noformat} > Verify with: > {noformat} > ./bin/hbase > 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Verify' \ > /path/to/tmp/warc-loader-output > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)