[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread Mikhail Khludnev (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158689#comment-13158689
 ] 

Mikhail Khludnev edited comment on SOLR-2382 at 11/28/11 7:52 PM:
--

James,

pls find my proof for absence of where="xid=x.id" support. 
TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - 
I'm  sorry for that. The test was green due to relying on keys order in the 
map. Wrapping by sorted map breaks that order and lead to peaking up wrong 
primarykey column. pls find explanation below.

from my pov the most cruel thing is 
[lines:27-28|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]
 it pick ups just first key from the map as primary key, when it wasn't 
properly detected from attributes. so this condition hides a problem, until 
just face it and address.

left part of where clause isn't used [here at lines 
45-48|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup]
 and "where=""" is ignored again [at lines 
185-190|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]

you can see that the second attach 
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch
 fixes the test by adding cachePk and lookup into attributes.

My proposals are:
* fix it. it's not a big deal to came where attr back
* but why the new attributes cachePk and cacheLoop are better than old where 
attribute ? in according to reply I vote for
** decommission where="" or for 
** rolling new cahePk/Lookup attributes back
* can't we add more randomization into 
AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar 
hidden issues. I propose to choose concrete map behaviour randomly: hash, 
sorted, sorted-reverse. WDYT?
* the names withWhereClause() and withKeyAndLookup() should be swapped. their 
content contradicts to [the 
names|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?view=markup]
{code}
  public void withWhereClause() {
...
"query", q, DIHCacheSupport.CACHE_PRIMARY_KEY,"id", 
DIHCacheSupport.CACHE_FOREIGN_KEY ,"
...
  public void withKeyAndLookup() {
...
Map entityAttrs = createMap("query", q, "where", "id=x.id",
...
{code}  


  was (Author: mkhludnev):
James,

pls find my proof for absence of where="xid=x.id" support. 
TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - 
I'm  sorry for that. The test was green due to relying on keys order in the 
map. Wrapping by sorted map breaks that order and lead to peaking up wrong 
primarykey column. pls find explanation below.

from my pov the most cruel thing is 
[lines:27-28|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]
 it pick ups just first key from the map as primary key, when it wasn't 
properly detected from attributes. so this condition hides a problem, until 
just face it and address.

left part of where clause isn't used [here at lines 
45-48|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup]
 and "where=""" is ignored again [at lines 
185-190|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]

you can see that the second attach 
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch
 fixes the test by adding cachePk and lookup into attributes.

My proposals are:
* fix it. it's not a big deal to came where attr back
* but why the new attributes cachePk and cacheLoop are better than old where 
attribute ? in according to reply I vote for
** decommission where="" or for 
** rolling new cahePk/Lookup attributes back
* can't we add more randomization into 
AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar 
hidden issues. I propose to use concrete map behaviour randomly: hash, sorted, 
sorted-reverse. WDYT?
* the names withWhereClause() and withKeyAndLookup() should be swapped. their 
content contradicts to [the 
names|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?view=markup]
{code}
  public void withWhereClause() {
...
"query", q, DIHCacheSupport.CACHE_PRIMARY_KEY,"id", 
DIHCacheSupport.CACHE_

[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread Mikhail Khludnev (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158623#comment-13158623
 ] 

Mikhail Khludnev edited comment on SOLR-2382 at 11/28/11 6:47 PM:
--

James,

Pls find my replies below
{quote}
Currently none of the existing unit tests fail with this change and I'm not 
sure exactly how to reproduce the problem you're describing. Could you create a 
failing unit test for this to clarify what you're experiencing?
{quote}
I've done it yesterday, I attached TestThreaded.java.patch on trunk r1144761.
 Can you see it in attachments?
 After you applied it both of new tests fail. The simplest one 
testCachedSingleThread_FullImport() can be fixed by disabling cache destroying 
by commenting DocBuilder.java line:473 - // entityProcessor.destroy();
I believe it should be properly covered by the separate issue.

{quote}
 Once again, a failing unit test would be helpful in knowing how to reproduce 
the specific problem you've found.
{quote}
testCachedSingleThread_FullImport() fails again if you remove  cachePk="xid" 
cacheLookup="x.id" from test config constant line 109. I briefly look into 
TestCachedSQLEntityProcessor it seems to me I've got where it is not accurate 
enough, let me attach my findings soon. 

{quote}
But if its happening in 3.4 then its not related to SOLR-2382, which is in 
Trunk/4.0 only
{quote}
The problem the same as it is at 3.x. My test TestThreaded.java.patch has 
testCached_*Multi*_ Thread_FullImport() which covers particularly this race 
condition. You can see it after make testCached_*Single*_Thread_FullImport() 
green somehow (see issue no.1). I can't spawn it as new issue because of the 
blocker no.1



  was (Author: mkhludnev):
James,

Pls find my replies below
{quote}
Currently none of the existing unit tests fail with this change and I'm not 
sure exactly how to reproduce the problem you're describing. Could you create a 
failing unit test for this to clarify what you're experiencing?
{quote}
I've done it yesterday, I attached TestThreaded.java.patch on trunk r1144761.
 Can you see it in attachments?
 After you applied it both of new tests fail. The simplest one 
testCachedSingleThread_FullImport() can be fixed by disabling cache destroying 
by commenting DocBuilder.java line:473 - // entityProcessor.destroy();
I believe it should be properly covered by the separate issue.

{quote}
 Once again, a failing unit test would be helpful in knowing how to reproduce 
the specific problem you've found.
{quote}
testCachedSingleThread_FullImport() fails again if you remove  cachePk="xid" 
cacheLookup="x.id" from test config constant line 109. I briefly look into 
TestCachedSQLEntityProcessor it seems to me I've got where it is not accurate 
enough, let me attach my findings soon. 

{quote}
But if its happening in 3.4 then its not related to SOLR-2382, which is in 
Trunk/4.0 only
{quote}
The problem the same as it is at 3.x. My test TestThreaded.java.patch has 
testCached_*Multi*_Thread_FullImport() which covers particularly this race 
condition. You can see it after make testCached_*Single*_Thread_FullImport() 
green somehow (see issue no.1). I can't spawn it as new issue because of the 
blocker no.1


  
> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also pr

[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

2011-11-27 Thread Mikhail Khludnev (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157979#comment-13157979
 ] 

Mikhail Khludnev edited comment on SOLR-2382 at 11/27/11 7:47 PM:
--

Hello,

I want to contribute test for Parent/Child usecase with 
CachedSqlentityProcessor in single- and multi-thread modes: 
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?

Actually, I've explored this case at 3.4 some 
time ago, but decided to wait a little until this re-factoring made a progress.

* the first issue is testCachedSingleThread_FullImport() failure. It's caused 
by 
{code:title=DocBuilder.java line 473}
   } finally {
entityProcessor.destroy();
  }
{code}
this code, which cleanups the cache, makes sense, but for parent entities only, 
and causes a failure for the child entities enumeration, when run() is called 
from line :510. It shouldn't be a big deal to fix. 

* then, some minor moaning: looks like where="xid=x.id" is not supported by new 
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a 
matter of opinion, backward compatibility and documentation. 

* the most interesting problem is failure of 
testCachedMultiThread_FullImport(). At 3.4 it's caused by concurrent access of 
child entities iteration state. Now it looks like 
DIHCacheSupport.dataSourceRowCache is accessed by multiple threads from 
ThreadedEntityProcessorWrapper. I have some ideas, but want to know your 
opinion.

Guys, I can handle some of these, let me know how I can help.

--
Mikhail

  was (Author: mkhludnev):
Hello,

I want to contribute test for Parent/Child usecase with 
CachedSqlentityProcessor in single- and multi-thread modes: 
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?

Actually, I've explored this case at 3.4 some 
time ago, but decided to wait a little until this re-factoring made a progress.

* the first issue is testCachedSingleThread_FullImport() failure. It's caused 
by 
{code:title=DocBuilder.java line 473}
   } finally {
entityProcessor.destroy();
  }
{code}
this code, which cleanups the cache, makes sense, but for parent entities only, 
and causes a failure for the child entities enumeration, when run() is called 
from line :510. It shouldn't be a big deal to fix. 

* then, some minor moaning: looks like where="xid=x.id" is not supported by new 
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a 
matter of opinion, backward compatibility and documentation. 

* the most interesting problem is failure of 
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child 
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache 
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have 
some ideas, but want to know your opinion.

Guys, I can handle some of these, let me know how I can help.

--
Mikhail
  
> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to

[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

2011-11-27 Thread Mikhail Khludnev (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157979#comment-13157979
 ] 

Mikhail Khludnev edited comment on SOLR-2382 at 11/27/11 7:46 PM:
--

Hello,

I want to contribute test for Parent/Child usecase with 
CachedSqlentityProcessor in single- and multi-thread modes: 
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?

Actually, I've explored this case at 3.4 some 
time ago, but decided to wait a little until this re-factoring made a progress.

* the first issue is testCachedSingleThread_FullImport() failure. It's caused 
by 
{code:title=DocBuilder.java line 473}
   } finally {
entityProcessor.destroy();
  }
{code}
this code, which cleanups the cache, makes sense, but for parent entities only, 
and causes a failure for the child entities enumeration, when run() is called 
from line :510. It shouldn't be a big deal to fix. 

* then, some minor moaning: looks like where="xid=x.id" is not supported by new 
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a 
matter of opinion, backward compatibility and documentation. 

* the most interesting problem is failure of 
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child 
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache 
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have 
some ideas, but want to know your opinion.

Guys, I can handle some of these, let me know how I can help.

--
Mikhail

  was (Author: mkhludnev):
Hello,

I want to contribute test for Parent/Child usecase with 
CachedSqlentityProcessor in single- and multi-thread modes: 
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?

Actually, I've explored this case at 3.4 some 
time ago, but decided to wait a little until this re-factoring made a progress.

* the first issue is testCachedSingleThread_FullImport() failure. It's caused 
by 
{code:title=DocBuilder.java line 473}
   } finally {
entityProcessor.destroy();
  }
{code}
this code, which clean-ups the cache, makes sense, but for parent entities 
only, and causes a failure for the child entities enumeration, when run() is 
called from line :510. It shouldn't be a big deal to fix. 

* then, some minor moaning: looks like where="xid=x.id" is not supported by new 
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a 
matter of opinion, backward compatibility and documentation. 

* the most interesting problem is failure of 
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child 
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache 
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have 
some ideas, but want to know your opinion.

Guys, I can handle some of these, let me know how I can help.

--
Mikhail
  
> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily 

[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

2011-10-12 Thread Noble Paul (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125693#comment-13125693
 ] 

Noble Paul edited comment on SOLR-2382 at 10/12/11 9:54 AM:


The DIHCache interface should not have a the following methods
{code:java}
/**
 * 
 *  Get the next document in the cache or NULL if the last record has 
been reached. 
 *  Use this method to efficiently iterate through the entire cache.
 * 
 * @return
 */
public Map getNext() ;

/**
 * Reset the cache's internal iterator so that a subsequent call to 
getNext() will return the first cached row
 */
public void resetNext() ;

{code}
instead we should have a methods
{code:java}
Iterator> getIterator()

Iterator> getIterator(String key)
{code}



This means all the states maintained for iteration will go away

If we can do that the following fields will be redundant in SortedMapBackedCache
{code:java}
private Iterator>>> 
theMapIter = null;
private Object currentKey = null;
private List> currentKeyResult = null;
private Iterator> currentKeyResultIter = null;

{code}

  was (Author: noble.paul):
The DIHCache interface should not have a the following methods
{code:java}
/**
 * 
 *  Get the next document in the cache or NULL if the last record has 
been reached. 
 *  Use this method to efficiently iterate through the entire cache.
 * 
 * @return
 */
public Map getNext() ;

/**
 * Reset the cache's internal iterator so that a subsequent call to 
getNext() will return the first cached row
 */
public void resetNext() ;

{code}
instead we should have a methods
{code:java}
Iterator> getIterator()

Iterator> getIterator(String key)
{code}



This means all the states maintained for iteration will go away


  
> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-

[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

2011-10-11 Thread Pulkit Singhal (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125376#comment-13125376
 ] 

Pulkit Singhal edited comment on SOLR-2382 at 10/11/11 8:49 PM:


James,

I'm trying to figure out how the data-config.xml would be written to support 
this case:
{code}
We needed a flexible & scalable way to temporarily cache child-entity data 
prior to joining to parent entities.
{code}
1) Can you please provide a sample?

2) Also my guess is that testing this would feature would require the 
application of both entities.patch and dihwriter.patch ... Correct? I am 
guessing so as I think that the dihwriter.patch has the code changes for data 
to be written-to and consumed-from a cache. In my use-case there is one huge 
text fixed-position child-entity file which has:
{code}
parentDataID_1 someData1
parentDataID_1 someData2
parentDataID_2 someData3
parentDataID_2 someData4
{code}
So I just want to make sure I'm getting the right set of patches and 
understanding the application of the use case properly. Please advice.

  was (Author: pulkitsing...@gmail.com):
James,

I'm trying to figure out how the data-config.xml would be written to support 
this case:
{code}
We needed a flexible & scalable way to temporarily cache child-entity data 
prior to joining to parent entities.
{code}
1) Can you please provide a sample?

2) Also my guess is that testing this would feature would require the 
application of both entities.patch and dihwriter.patch ... Correct? I am 
guessing so as I think that the dihwriter.patch has the code changes for data 
to be written-to and consumed-from a cache. In my use-case there is one huge 
text fixed-position child-entity file which has:
parentDataID_1 someData1
parentDataID_1 someData2
parentDataID_2 someData3
parentDataID_2 someData4
So I just want to make sure I'm getting the right set of patches and 
understanding the application of the use case properly. Please advice.
  
> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implem

[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

2011-06-23 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054263#comment-13054263
 ] 

Noble Paul edited comment on SOLR-2382 at 6/24/11 6:42 AM:
---

bq.cacheInit() in EntityProcessorBase specifically passes only the parameters 
that apply to the current situation

it doen't matter . It can use any params which are relevant to it. Anyway you 
can't define what params are required for a future DIHCache impl. Look at a 
Transformer implementation it can read anything it wants. The cache should be 
initialized like that only 

Why should the DocBuilder be even aware of DIHCache , Should it not be kept 
local to the EntityProcessor?

  was (Author: noble.paul):
bq.cacheInit() in EntityProcessorBase specifically passes only the 
parameters that apply to the current situation

it doen't matter . It can use any params which are relevant to it. Anyway you 
can't define what params are required for a future DIHCache impl. Look at a 
Transformer implementation it can read anything it wants. The cache should be 
initialized like that only 
  
> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>- SolrWriter (refactored)
>- DIHCacheWriter (allows DIH to write ultimately to a Cache).
>
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheP