[ https://issues.apache.org/jira/browse/SOLR-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amrit Sarkar updated SOLR-12854: -------------------------------- Description: Delta imports in DataImportHandler is sometimes slower than full imports where the delta import makes multiple queries compare to full import and hence making it time complex. Listed in: https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport In the mailing list; http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-td4338162.html one of the Solr users have noted a workaround which works perfectly and improves delta import performance, where we need to specify ${dataimporter.last_index_time} in the delta_import_query, and not delta_query. {code} I found a hacky way to limit the number of times deltaImportQuery was executed. As designed, solr executes deltaQuery to get a list of ids that need to be indexed. For each of those, it executes deltaImportQuery, which is typically very similar to the full query. I constructed a deltaQuery to purposely only return 1 row. E.g. deltaQuery = "SELECT id FROM table WHERE rownum=1" // written for oracle, likely requires a different syntax for other dbs. Also, it occurred to you could probably include the date>= '${dataimporter.last_index_time}' filter here so this returns 0 rows if no data has changed Since deltaImportQuery now *only gets called once I needed to add the filter logic to *deltaImportQuery *to only select the changed rows (that logic is normally in *deltaQuery). E.g. deltaImportQuery = [normal import query] WHERE date >= '${dataimporter.last_index_time}' {code} A number of other users have adopted the strategy and DIH delta import performance has improved, and henceforth documenting this strategy as TIP will help other users too. was: Delta imports in DataImportHandler is sometimes slower than full imports where the delta import makes multiple queries compare to full import and hence making it time complex. Listed in: https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport In the mailing list; http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-td4338162.html one of the Solr users have noted a workaround which works perfectly and improves delta import performance, where we need to specify ${dataimporter.last_index_time} in the delta_import_query, and not delta_sql_query. {code} I found a hacky way to limit the number of times deltaImportQuery was executed. As designed, solr executes deltaQuery to get a list of ids that need to be indexed. For each of those, it executes deltaImportQuery, which is typically very similar to the full query. I constructed a deltaQuery to purposely only return 1 row. E.g. deltaQuery = "SELECT id FROM table WHERE rownum=1" // written for oracle, likely requires a different syntax for other dbs. Also, it occurred to you could probably include the date>= '${dataimporter.last_index_time}' filter here so this returns 0 rows if no data has changed Since deltaImportQuery now *only gets called once I needed to add the filter logic to *deltaImportQuery *to only select the changed rows (that logic is normally in *deltaQuery). E.g. deltaImportQuery = [normal import query] WHERE date >= '${dataimporter.last_index_time}' {code} A number of other users have adopted the strategy and DIH delta import performance has improved, and henceforth documenting this strategy as TIP will help other users too. > Document steps to improve delta import via DataImportHandler > ------------------------------------------------------------- > > Key: SOLR-12854 > URL: https://issues.apache.org/jira/browse/SOLR-12854 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: contrib - DataImportHandler > Affects Versions: 7.5 > Reporter: Amrit Sarkar > Priority: Major > > Delta imports in DataImportHandler is sometimes slower than full imports > where the delta import makes multiple queries compare to full import and > hence making it time complex. Listed in: > https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport > In the mailing list; > http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-td4338162.html > one of the Solr users have noted a workaround which works perfectly and > improves delta import performance, where we need to specify > ${dataimporter.last_index_time} in the delta_import_query, and not > delta_query. > {code} > I found a hacky way to limit the number of > times deltaImportQuery was executed. > As designed, solr executes deltaQuery to get a list of ids that need to be > indexed. For each of those, it executes deltaImportQuery, which is typically > very similar to the full query. > I constructed a deltaQuery to purposely only return 1 row. E.g. > deltaQuery = "SELECT id FROM table WHERE rownum=1" // written for > oracle, likely requires a different syntax for other dbs. Also, it occurred > to you could probably include the date>= '${dataimporter.last_index_time}' > filter here so this returns 0 rows if no data has changed > Since deltaImportQuery now *only gets called once I needed to add the filter > logic to *deltaImportQuery *to only select the changed rows (that logic is > normally in *deltaQuery). E.g. > deltaImportQuery = [normal import query] WHERE date >= > '${dataimporter.last_index_time}' > {code} > A number of other users have adopted the strategy and DIH delta import > performance has improved, and henceforth documenting this strategy as TIP > will help other users too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org