[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3

sraghunandan Wed, 01 Aug 2018 17:55:27 -0700

Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2576#discussion_r207074600
  
    --- Diff: docs/s3-guide.md ---
    @@ -0,0 +1,64 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more 
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership. 
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with 
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software 
    +    distributed under the License is distributed on an "AS IS" BASIS, 
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
    +    See the License for the specific language governing permissions and 
    +    limitations under the License.
    +-->
    +
    +#S3 Guide (Alpha Feature 1.4.1)
    +S3 is an Object Storage API on cloud, it is recommended for storing large 
data files. You can use 
    +this feature if you want to store data on Amazon cloud or Huawei 
cloud(OBS).
    +Since the data is stored on to cloud there are no restrictions on the size 
of data and the data can be accessed from anywhere at any time.
    +Carbondata can support any Object Storage that conforms to Amazon S3 API.
    +
    +#Writing to Object Storage
    +To store carbondata files on to Object Store location, you need to set 
`carbon
    +.storelocation` property to Object Store path in CarbonProperties file. 
For example, carbon
    +.storelocation=s3a://mybucket/carbonstore. By setting this property, all 
the tables will be created on the specified Object Store path.
    +
    +If your existing store is HDFS, and you want to store specific tables on 
S3 location, then `location` parameter has to be set during create 
    +table. 
    +For example:
    +
    +```
    +CREATE TABLE IF NOT EXISTS db1.table1(col1 string, col2 int) STORED AS 
carbondata LOCATION 's3a://mybucket/carbonstore'
    +``` 
    +
    +For more details on create table, Refer 
[data-management-on-carbondata](https://github.com/apache/carbondata/blob/master/docs/data-management-on-carbondata.md#create-table)
    +
    +#Authentication
    +You need to set authentication properties to store the carbondata files on 
to S3 location. For 
    +more details on authentication properties, refer 
    +[hadoop authentication 
document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties)
    +
    +Another way of setting the authentication parameters is as follows:
    +
    +```
    +         SparkSession
    +         .builder()
    +         .master(masterURL)
    +         .appName("S3Example")
    +         .config("spark.driver.host", "localhost")
    +         .config("spark.hadoop.fs.s3a.access.key", "1111")
    +         .config("spark.hadoop.fs.s3a.secret.key", "2222")
    +         .config("spark.hadoop.fs.s3a.endpoint", "1.1.1.1")
    +         .getOrCreateCarbonSession()
    +```
    +
    +#Recommendations
    +1. Object Storage like S3 does not support file leasing 
mechanism(supported by HDFS) that is 
    +required to take locks which ensure consistency between concurrent 
operations therefore, it is 
    +recommended to set the configurable lock path 
property([carbon.lock.path](https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md#miscellaneous-configuration))
    + to a HDFS directory.
    +2. As Object Storage are eventual consistent meaning that any put request 
can take some time to 
    --- End diff --
    
    Concurrent data manipulation operations are not supported. object stores 
follow eventual consistency semantics,ie.,any put request might take some time 
to reflect when trying to list.This behaviour causes not to ensure the data 
read is always consistent or latest.

---

[GitHub] carbondata pull request #2576: [CARBONDATA-2795] Add documentation for S3

Reply via email to