GitHub user aniketadnaik opened a pull request:
https://github.com/apache/carbondata/pull/1352
[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming
examples
- Description:
This change is mainly targeted for "streaming_ingest" development branch.
Following changes are added on top of previous framework changes (pr-1064):
1. schema validation of input data if its from a file source when schema is
specified. We validate source schema against existing table schema. For socket
source , there is no schema validation required since there is no schema
attached to it.
2. added streaming examples - for file stream and socket stream sources,
CarbonStreamingIngestFileSourceExample.scala ,
CarbonStreamingIngestSocketSourceExample.scala
these examples are added to facilitate development activity to understand
and analyze code flow. The examples would run in its totality when carbondata
is able write into carbondata file format.
- Whether new unit test cases have been added or why no new tests are
required?
Yes , new unit test for schema validation has been added
- What manual testing you have done?
$> mvn clean -Pspark-2.1 -Dspark.version=2.1.0 verify
[INFO]
------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [
1.320 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [
1.509 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [
26.109 s]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [
4.892 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [
8.910 s]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS [
13.876 s]
[INFO] Apache CarbonData :: Spark2 ........................ SUCCESS [02:29
min]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [07:06
min]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [
1.724 s]
[INFO] Apache CarbonData :: Flink Examples ................ SUCCESS [
2.480 s]
[INFO] Apache CarbonData :: Hive .......................... SUCCESS [
4.776 s]
[INFO] Apache CarbonData :: presto ........................ SUCCESS [
5.786 s]
[INFO] Apache CarbonData :: Spark2 Examples ............... SUCCESS [
4.957 s]
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 10:52 min
[INFO] Finished at: 2017-09-12T10:50:40-07:00
[INFO] Final Memory: 119M/1223M
[INFO]
------------------------------------------------------------------------
$> mvn clean verify
[INFO]
------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [
6.925 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [
10.383 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS
[02:07 min]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [
21.376 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [
18.568 s]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS
[01:03 min]
[INFO] Apache CarbonData :: Spark ......................... SUCCESS
[04:34 min]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS
[24:33 min]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [
8.661 s]
[INFO] Apache CarbonData :: Spark Examples ................ SUCCESS [
22.520 s]
[INFO] Apache CarbonData :: Flink Examples ................ SUCCESS [
6.592 s]
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 33:55 min
[INFO] Finished at: 2017-09-12T08:12:30-07:00
[INFO] Final Memory: 62M/298M
[INFO]
------------------------------------------------------------------------
* Made sure write path class invocation and schema validation happens
correctly with with spark structured streaming (2.1) and parquet file source
* Made sure write path execution work flow with structured
streaming(2.1) for both socket and file
sources
- Any additional information to help reviewers in testing this change.
For invalid schema carbondata throws exception and no record writer will be
be instantiated. This is kind of first level of validation of input streaming
data at CarbonSource entry point, another level of input data validation
happens in carbon load path anyway.
Some file sources allow schema to be inferred if
"spark.sql.streaming.schemaInference" is set to true and if no explicit schema
is specified.In such case we validate againist inferred schema. Carbondata also
provides inferSchema functionality if table path is provided.The inferSchema()
functionality is used in read path (readStream) and will be applicable when
read path functionality is implemented.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aniketadnaik/carbondataStreamIngest
streamIngest-1174
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1352.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1352
----
commit ac4f9c2a3bd0c7e7569bde9ce797abcb424222a4
Author: Aniket Adnaik <[email protected]>
Date: 2017-09-08T00:28:00Z
[CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples
commit 8e710b8b5265cc1b3db52deecfae2086cb46993b
Author: Aniket Adnaik <[email protected]>
Date: 2017-09-09T00:32:02Z
[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming
examples
commit 991d12aa0ec8ec58a5763f28ef6260c668b1f1c4
Author: Aniket Adnaik <[email protected]>
Date: 2017-09-09T00:32:39Z
[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming
examples
commit 61d283ef63faabdd97e90d0c5f6d862f073c5b2b
Author: Aniket Adnaik <[email protected]>
Date: 2017-09-10T00:54:03Z
[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming
examples
commit 6e24d4fa1af90bd61a4c1bb5bf80321135761973
Author: Aniket Adnaik <[email protected]>
Date: 2017-09-12T01:59:48Z
[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming
examples
commit 84fb1b76ce319841721db0ed8ef719b16d6c9acf
Author: Aniket Adnaik <[email protected]>
Date: 2017-09-12T08:53:07Z
[CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples
commit 97646ae45defa1d09bcefa04ddd0497e9238e8fa
Author: Aniket Adnaik <[email protected]>
Date: 2017-09-12T14:36:54Z
[CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples
----
---