a2l007 edited a comment on pull request #9449:
URL: https://github.com/apache/druid/pull/9449#issuecomment-636069152


   @2bethere 
   Thanks for taking a look.
   > If the SQL table has a timestamp like column, is there a way for me to 
specify this as a parameter so that not the entire table gets pulled?
   
   You could use WHERE clauses within your SQL query to restrict the data based 
on your requirements. It is recommended to filter SQL queries based on the 
intervals specified in the granularity spec so as to avoid handling unwanted 
data.
   
   > Is there a way for me to specify which column to split this on? Because 
the user might already know how the table is sharded/partitioned to make it 
more efficient in parallel ingestion
   
   There isnt a direct way to split the input data based on a column as this 
InputSource splits the task into sub tasks based on the the number of SQL 
queries. One way you could spread out the data across sub-tasks would be to 
introduce pagination within your SQL queries.
   
   > If incremental loads are supported, how are duplicates handled? Do I 
specify a key or this is handled downstream?
   
   This InputSource is no different any other native batch `InputSource`  types 
in terms of handling updates. Therefore, any changes in your source db for a 
specific interval would require you to ingest the entire data for that interval 
again and this will replace the existing segments for the interval. 
   Hope that helps.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to