Leona Yoda created SPARK-36024:
----------------------------------
Summary: Switch the datasource example due to the depreciation of
the dataset
Key: SPARK-36024
URL: https://issues.apache.org/jira/browse/SPARK-36024
Project: Spark
Issue Type: Documentation
Components: Documentation
Affects Versions: 3.1.2
Reporter: Leona Yoda
The S3 bucket that used for an example in "Integration with Cloud
Infrastructures" document will be deleted on Jul 1, 2021
[https://registry.opendata.aws/landsat-8/
|https://registry.opendata.aws/landsat-8/]
The dataset will move to another bucket but it requires `--request-payer
requester` option so users have to pay S3 cost.
[https://registry.opendata.aws/usgs-landsat/]
So I think it's better to change the datasource like this.
[https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022]
I chose NYC Taxi data
([https://registry.opendata.aws/nyc-tlc-trip-records-pds/)|https://registry.opendata.aws/nyc-tlc-trip-records-pds/),]
here for an example.
Unlike landat data it's not compressed, but it is just an example and there are
several tutorials using Spark (e.g.
[https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)]
Reed test result
{code:java}
scala> sc.textFile("s3a://nyc-tlc/misc/taxi
_zone_lookup.csv").take(10).foreach(println)
"LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR"
2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham Gardens","Boro
Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten Island","Arden
Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone"
7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria Park","Boro Zone"
9,"Queens","Auburndale","Boro Zone"
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]