GitHub user aniketadnaik opened a pull request:

    https://github.com/apache/carbondata/pull/1352

    [CARBONDATA-1174] Streaming Ingestion - schema validation and streaming 
examples

    - Description:
    This change is mainly targeted for "streaming_ingest" development branch. 
Following changes are added on top of previous framework changes (pr-1064): 
    1. schema validation of input data if its from a file source when schema is 
specified. We validate source schema against existing table schema. For socket 
source , there is no schema validation required since there is no schema 
attached to it.
    2. added streaming examples - for file stream and socket stream sources,
    CarbonStreamingIngestFileSourceExample.scala , 
CarbonStreamingIngestSocketSourceExample.scala
     these examples are added to facilitate development activity to understand 
and analyze code flow. The examples would run in its totality when carbondata 
is able write into carbondata file format. 
    
    - Whether new unit test cases have been added or why no new tests are 
required?
      Yes , new unit test for schema validation has been added
    
    - What manual testing you have done?
    $> mvn clean -Pspark-2.1 -Dspark.version=2.1.0  verify
    [INFO] 
------------------------------------------------------------------------
    [INFO] Reactor Summary:
    [INFO] 
    [INFO] Apache CarbonData :: Parent ........................ SUCCESS [  
1.320 s]
    [INFO] Apache CarbonData :: Common ........................ SUCCESS [  
1.509 s]
    [INFO] Apache CarbonData :: Core .......................... SUCCESS [ 
26.109 s]
    [INFO] Apache CarbonData :: Processing .................... SUCCESS [  
4.892 s]
    [INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [  
8.910 s]
    [INFO] Apache CarbonData :: Spark Common .................. SUCCESS [ 
13.876 s]
    [INFO] Apache CarbonData :: Spark2 ........................ SUCCESS [02:29 
min]
    [INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [07:06 
min]
    [INFO] Apache CarbonData :: Assembly ...................... SUCCESS [  
1.724 s]
    [INFO] Apache CarbonData :: Flink Examples ................ SUCCESS [  
2.480 s]
    [INFO] Apache CarbonData :: Hive .......................... SUCCESS [  
4.776 s]
    [INFO] Apache CarbonData :: presto ........................ SUCCESS [  
5.786 s]
    [INFO] Apache CarbonData :: Spark2 Examples ............... SUCCESS [  
4.957 s]
    [INFO] 
------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] 
------------------------------------------------------------------------
    [INFO] Total time: 10:52 min
    [INFO] Finished at: 2017-09-12T10:50:40-07:00
    [INFO] Final Memory: 119M/1223M
    [INFO] 
------------------------------------------------------------------------
    
      $> mvn clean verify
      [INFO] 
------------------------------------------------------------------------
      [INFO] Reactor Summary:
      [INFO] 
      [INFO] Apache CarbonData :: Parent ........................ SUCCESS [  
6.925 s]
      [INFO] Apache CarbonData :: Common ........................ SUCCESS [ 
10.383 s]
      [INFO] Apache CarbonData :: Core .......................... SUCCESS 
[02:07 min]
      [INFO] Apache CarbonData :: Processing .................... SUCCESS [ 
21.376 s]
      [INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 
18.568 s]
      [INFO] Apache CarbonData :: Spark Common .................. SUCCESS 
[01:03 min]
      [INFO] Apache CarbonData :: Spark ......................... SUCCESS 
[04:34 min]
      [INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS 
[24:33 min]
      [INFO] Apache CarbonData :: Assembly ...................... SUCCESS [  
8.661 s]
      [INFO] Apache CarbonData :: Spark Examples ................ SUCCESS [ 
22.520 s]
      [INFO] Apache CarbonData :: Flink Examples ................ SUCCESS [  
6.592 s]
      [INFO] 
------------------------------------------------------------------------
      [INFO] BUILD SUCCESS
      [INFO] 
------------------------------------------------------------------------
      [INFO] Total time: 33:55 min
      [INFO] Finished at: 2017-09-12T08:12:30-07:00
      [INFO] Final Memory: 62M/298M
      [INFO] 
------------------------------------------------------------------------
        * Made sure write path class invocation and schema validation happens 
correctly with with spark structured streaming (2.1) and parquet file source
        * Made sure write path execution work flow with structured 
streaming(2.1) for both socket and file
    sources
    
    - Any additional information to help reviewers in testing this change.
    For invalid schema carbondata throws exception and no record writer will be 
be instantiated. This is kind of first level of validation of input streaming 
data at CarbonSource entry point, another level of input data validation 
happens in carbon load path anyway.  
    Some file sources allow schema to be inferred if  
"spark.sql.streaming.schemaInference" is set to true and if no explicit schema 
is specified.In such case we validate againist inferred schema. Carbondata also 
provides inferSchema functionality if table path is provided.The inferSchema() 
functionality is used in read path (readStream) and will be applicable when 
read path functionality is implemented.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aniketadnaik/carbondataStreamIngest 
streamIngest-1174

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1352
    
----
commit ac4f9c2a3bd0c7e7569bde9ce797abcb424222a4
Author: Aniket Adnaik <[email protected]>
Date:   2017-09-08T00:28:00Z

    [CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples

commit 8e710b8b5265cc1b3db52deecfae2086cb46993b
Author: Aniket Adnaik <[email protected]>
Date:   2017-09-09T00:32:02Z

    [CARBONDATA-1174] Streaming Ingestion - schema validation and streaming 
examples

commit 991d12aa0ec8ec58a5763f28ef6260c668b1f1c4
Author: Aniket Adnaik <[email protected]>
Date:   2017-09-09T00:32:39Z

    [CARBONDATA-1174] Streaming Ingestion - schema validation and streaming 
examples

commit 61d283ef63faabdd97e90d0c5f6d862f073c5b2b
Author: Aniket Adnaik <[email protected]>
Date:   2017-09-10T00:54:03Z

    [CARBONDATA-1174] Streaming Ingestion - schema validation and streaming 
examples

commit 6e24d4fa1af90bd61a4c1bb5bf80321135761973
Author: Aniket Adnaik <[email protected]>
Date:   2017-09-12T01:59:48Z

    [CARBONDATA-1174] Streaming Ingestion - schema validation and streaming 
examples

commit 84fb1b76ce319841721db0ed8ef719b16d6c9acf
Author: Aniket Adnaik <[email protected]>
Date:   2017-09-12T08:53:07Z

    [CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples

commit 97646ae45defa1d09bcefa04ddd0497e9238e8fa
Author: Aniket Adnaik <[email protected]>
Date:   2017-09-12T14:36:54Z

     [CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples

----


---

Reply via email to