a2l007 edited a comment on pull request #9449: URL: https://github.com/apache/druid/pull/9449#issuecomment-636069152
@2bethere Thanks for taking a look. > If the SQL table has a timestamp like column, is there a way for me to specify this as a parameter so that not the entire table gets pulled? You could use WHERE clauses within your SQL query to restrict the data based on your requirements. It is recommended to filter SQL queries based on the intervals specified in the granularity spec so as to avoid handling unwanted data. > Is there a way for me to specify which column to split this on? Because the user might already know how the table is sharded/partitioned to make it more efficient in parallel ingestion There isnt a direct way to split the input data based on a column as this InputSource splits the task into sub tasks based on the the number of SQL queries. One way you could spread out the data across sub-tasks would be to introduce pagination within your SQL queries. > If incremental loads are supported, how are duplicates handled? Do I specify a key or this is handled downstream? This InputSource is no different any other native batch `InputSource` types in terms of handling updates. Therefore, any changes in your source db for a specific interval would require you to ingest the entire data for that interval again and this will replace the existing segments for the interval. Hope that helps. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
