[GitHub] [druid] jihoonson commented on a change in pull request #9449: Add Sql InputSource

GitBox Thu, 07 May 2020 15:50:25 -0700


jihoonson commented on a change in pull request #9449:
URL: https://github.com/apache/druid/pull/9449#discussion_r421836358




##########
File path: docs/ingestion/native-batch.md
##########
@@ -1310,6 +1311,43 @@ A spec that applies a filter and reads a subset of the 
original datasource's col
 This spec above will only return the `page`, `user` dimensions and `added` 
metric.
 Only rows where `page` = `Druid` will be returned.
 
+### Sql Input Source
+
+The SQL input source is used to read data directly from RDBMS.
+The SQL input source is _splittable_ and can be used by the [Parallel 
task](#parallel-task), where each worker task will read from one SQL query from 
the list of queries.
+Since this input source has a fixed input format for reading events, no 
`inputFormat` field needs to be specified in the ingestion spec when using this 
input source.
+
+|property|description|required?|
+|--------|-----------|---------|
+|type|This should be "sql".|Yes|
+|database|Specifies the database connection details.|Yes|

Review comment:
       Would you add more detailed docs for this parameter? It should probably 
mention that you have to load some extension to read from a particular type of 
database.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -1310,6 +1311,43 @@ A spec that applies a filter and reads a subset of the 
original datasource's col
 This spec above will only return the `page`, `user` dimensions and `added` 
metric.
 Only rows where `page` = `Druid` will be returned.
 
+### Sql Input Source

Review comment:
       One more thing, I remember that many people from our community have been 
asking about how to use `SqlFirehose`. What do you think about adding a section 
that explains how to use it in production environment? To be honest, it's not 
clear for me what are best practices to make a scalable and efficient pipeline 
using this input source. For example, how do you parallelize each ingestion 
task (which means, how do you split queries)? How do you handle data updates in 
database after ingestion? How often do you run ingestion jobs? and so on.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jihoonson commented on a change in pull request #9449: Add Sql InputSource

Reply via email to