Re: [jira] [Commented] (SOLR-4787) Join Contrib

Erick Erickson Mon, 14 Dec 2015 08:40:58 -0800

bq: reader.terms(field) would ever return null if the field name is
defined in solrconfig.xml


I'm a little out of my depth here, but assuming you mean schema.xml above, just
because a field is defined there doesn't mean that it's ever actually
used in any
of the index structures unless at least one document adds a value for
that field.
Lucene doesn't know that schema.xml exists at all, that's a convention imposed
by Solr. Could it be that no document in your corpus has any term for
that particular
field?

Best,
Erick

On Mon, Dec 14, 2015 at 8:24 AM, Marcus Bergner (JIRA) <[email protected]> wrote:
>
>     [ 
> https://issues.apache.org/jira/browse/SOLR-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056220#comment-15056220
>  ]
>
> Marcus Bergner commented on SOLR-4787:
> --------------------------------------
>
> I've been trying out the patches in this ticket for a while (with Solr 4.9.1) 
> and most recently the patches by [~krantiparisa] that uses 
> UnInvertedLongField to handle multi-value "to"/"from" in a hjoin query. After 
> stumbling on a couple of issues I think I have something that works 
> reasonably well. The last thing that had me puzzled was that after indexing 
> slightly more data than a very trivial test set in the index I was getting 
> NullPointerException in the UnInvertedLongField constructor:
>
> {noformat}
> Terms terms = reader.terms(field);  // returns null
> ...
> TermsEnum termsEnum = terms.iterator(null);   // throws NullPointerException
> {noformat}
>
> I've added various null checks in the code, but I don't really understand 
> how/why reader.terms(field) would ever return null if the field name is 
> defined in solrconfig.xml? Also, how bad would it be to simply have empty 
> arrays in UnInvertedLongField and ignore any docId values >= length of the 
> internal arrays if reader.terms(field) for some reason does return null? My 
> tests so far look promising with such a patch but it gives me a somewhat bad 
> feeling.
>
> This was my tiny test hierarchy that worked, the ids were large integers (19 
> digits).
>
> ||ID||Members||Comment||
> |id1|id2|Level A|
> |id2|id3, id4|Level B|
> |id3| |Level C|
> |id4| |Level C|
> |id5| |Level C, no parent, added later after initial tests|
>
> Using the above documents 1-4 a \{!hjoin fromIndex=coll from=memberslvlB 
> to=idlvlC\}... worked and swapping to/from for the reversed search also 
> worked. Adding an additional document to the index (id5) that was not a 
> member of level B in this case caused the same queries to fail with 
> NullPointerException in UnInvertedLongField constructor. Note that each 
> "level" in the hierarchy here has their own field names both for their id 
> fields and member list field (basically "$type.id" and "$type.member"). I 
> first thought it could be related to dynamic fields but after changing my 
> indexing and Solr schema to use real fields I could see the same problem.
>
>
>> Join Contrib
>> ------------
>>
>>                 Key: SOLR-4787
>>                 URL: https://issues.apache.org/jira/browse/SOLR-4787
>>             Project: Solr
>>          Issue Type: New Feature
>>          Components: search
>>    Affects Versions: 4.2.1
>>            Reporter: Joel Bernstein
>>            Priority: Minor
>>             Fix For: Trunk
>>
>>         Attachments: SOLR-4787-deadlock-fix.patch, 
>> SOLR-4787-pjoin-long-keys.patch, SOLR-4787-with-testcase-fix.patch, 
>> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
>> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
>> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
>> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
>> SOLR-4797-hjoin-multivaluekeys-nestedJoins.patch, 
>> SOLR-4797-hjoin-multivaluekeys-trunk.patch
>>
>>
>> This contrib provides a place where different join implementations can be 
>> contributed to Solr. This contrib currently includes 3 join implementations. 
>> The initial patch was generated from the Solr 4.3 tag. Because of changes in 
>> the FieldCache API this patch will only build with Solr 4.2 or above.
>> *HashSetJoinQParserPlugin aka hjoin*
>> The hjoin provides a join implementation that filters results in one core 
>> based on the results of a search in another core. This is similar in 
>> functionality to the JoinQParserPlugin but the implementation differs in a 
>> couple of important ways.
>> The first way is that the hjoin is designed to work with int and long join 
>> keys only. So, in order to use hjoin, int or long join keys must be included 
>> in both the to and from core.
>> The second difference is that the hjoin builds memory structures that are 
>> used to quickly connect the join keys. So, the hjoin will need more memory 
>> then the JoinQParserPlugin to perform the join.
>> The main advantage of the hjoin is that it can scale to join millions of 
>> keys between cores and provide sub-second response time. The hjoin should 
>> work well with up to two million results from the fromIndex and tens of 
>> millions of results from the main query.
>> The hjoin supports the following features:
>> 1) Both lucene query and PostFilter implementations. A *"cost"* > 99 will 
>> turn on the PostFilter. The PostFilter will typically outperform the Lucene 
>> query when the main query results have been narrowed down.
>> 2) With the lucene query implementation there is an option to build the 
>> filter with threads. This can greatly improve the performance of the query 
>> if the main query index is very large. The "threads" parameter turns on 
>> threading. For example *threads=6* will use 6 threads to build the filter. 
>> This will setup a fixed threadpool with six threads to handle all hjoin 
>> requests. Once the threadpool is created the hjoin will always use it to 
>> build the filter. Threading does not come into play with the PostFilter.
>> 3) The *size* local parameter can be used to set the initial size of the 
>> hashset used to perform the join. If this is set above the number of results 
>> from the fromIndex then the you can avoid hashset resizing which improves 
>> performance.
>> 4) Nested filter queries. The local parameter "fq" can be used to nest a 
>> filter query within the join. The nested fq will filter the results of the 
>> join query. This can point to another join to support nested joins.
>> 5) Full caching support for the lucene query implementation. The filterCache 
>> and queryResultCache should work properly even with deep nesting of joins. 
>> Only the queryResultCache comes into play with the PostFilter implementation 
>> because PostFilters are not cacheable in the filterCache.
>> The syntax of the hjoin is similar to the JoinQParserPlugin except that the 
>> plugin is referenced by the string "hjoin" rather then "join".
>> fq=\{!hjoin fromIndex=collection2 from=id_i to=id_i threads=6 
>> fq=$qq\}user:customer1&qq=group:5
>> The example filter query above will search the fromIndex (collection2) for 
>> "user:customer1" applying the local fq parameter to filter the results. The 
>> lucene filter query will be built using 6 threads. This query will generate 
>> a list of values from the "from" field that will be used to filter the main 
>> query. Only records from the main query, where the "to" field is present in 
>> the "from" list will be included in the results.
>> The solrconfig.xml in the main query core must contain the reference to the 
>> hjoin.
>> <queryParser name="hjoin" 
>> class="org.apache.solr.joins.HashSetJoinQParserPlugin"/>
>> And the join contrib lib jars must be registed in the solrconfig.xml.
>>  <lib dir="../../../contrib/joins/lib" regex=".*\.jar" />
>> After issuing the "ant dist" command from inside the solr directory the 
>> joins contrib jar will appear in the solr/dist directory. Place the the 
>> solr-joins-4.*-.jar  in the WEB-INF/lib directory of the solr 
>> webapplication. This will ensure that the top level Solr classloader loads 
>> these classes rather then the core's classloaded.
>> *BitSetJoinQParserPlugin aka bjoin*
>> The bjoin behaves exactly like the hjoin but uses a BitSet instead of a 
>> HashSet to perform the underlying join. Because of this the bjoin is much 
>> faster and can provide sub-second response times on result sets of tens of 
>> millions of records from the fromIndex and hundreds of millions of records 
>> from the main query.
>> But there are limitations to how the bjoin can be used. The bjoin treats the 
>> join keys as addresses in a BitSet and uses the Lucene OpenBitSet 
>> implementation which performs very well but is not sparse. So the BitSet 
>> memory is dictated by the size of the join keys. For example a bitset with a 
>> max join key of 200,000,000 will need 25 MB of memory. For this reason the 
>> BitSet join does not support long join keys. In order to keep memory usage 
>> down the join keys should also be packed at the low end, for example from 1 
>> to 50,000,000.
>> Below is a sampe bjoin:
>> fq=\{!bjoin fromIndex=collection2 from=id_i to=id_i threads=6 
>> fq=$qq\}user:customer1&qq=group:5
>> To register the bjoin the solrconfig.xml in the main query core must contain 
>> the reference to the bjoin.
>> <queryParser name="bjoin" 
>> class="org.apache.solr.joins.BitSetJoinQParserPlugin"/>
>> *ValueSourceJoinParserPlugin aka vjoin*
>> The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". 
>> This implements a ValueSource function query that can return a value from a 
>> second core based on join keys and limiting query. The limiting query can be 
>> used to select a specific subset of data from the join core. This allows 
>> customer specific relevance data to be stored in a separate core and then 
>> joined in the main query.
>> The vjoin is called using the "vjoin" function query. For example:
>> bf=vjoin(fromCore, fromKey, fromVal, toKey, query)
>> This example shows "vjoin" being called by the edismax boost function 
>> parameter. This example will return the "fromVal" from the "fromCore". The 
>> "fromKey" and "toKey" are used to link the records from the main query to 
>> the records in the "fromCore". The "query" is used to select a specific set 
>> of records to join with in fromCore.
>> Currently the fromKey and toKey must be longs but this will change in future 
>> versions. Like the pjoin, the "join" SolrCache is used to hold the join 
>> memory structures.
>> To configure the vjoin you must register the ValueSource plugin in the 
>> solrconfig.xml as follows:
>> <valueSourceParser name="vjoin" 
>> class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] [Commented] (SOLR-4787) Join Contrib

Reply via email to