Distributed result set merging in Solr
Currently distributed requests are entirely initiated by whichever node receives a query, correct? That is, as far as I know shards don't talk to each other or send requests back to the controller. I'm looking at sending stats facets between shards to speed up merging. Rather than have one node responsible for merging the facet sets from every shard, each facet set is partitioned by term and then each shard merges one partition of each facet set. A-D, E-G, etc. However, that kind of communication doesn't really fit into Solr's current model of distributed processing. I think my use case isn't the only instance where it could help performance for shards to talk amongst themselves, so I'm curious why nothing in Solr does. Is it deliberate? No one bothered? I'm wrong and nothing else has a non-trivial reduce step? Steve McKay | Software Developer | GCE steve.mc...@gcecloud.commailto:steve.mc...@gcecloud.com | (703) 390-3044 desk | (703) 659-0608 Skype | (443) 710-2762 mobile Connect with GCE: www.GCEcloud.comhttp://www.gcecloud.com/ | Facebookhttps://www.facebook.com/GCEcloud | Twitterhttps://twitter.com/GCECloud | Google+https://plus.google.com/u/0/112948441992350338884/posts The information contained in this e-mail and any attachment(s) is Confidential, Privileged, Protected from any disclosure, and proprietary to Global Computer Enterprises, Inc. The person addressed in the email is the sole authorized recipient. If you are not the intended recipient, you are hereby notified that any review, use, disclosure, retransmission, dissemination, distribution, copying, or any other actions related to this information is strictly prohibited. If you have received this communication in error, please inform the sender and delete or destroy any copy of this message.
Re: Distributed result set merging in Solr
On Dec 18, 2012, at 6:50 PM, Yonik Seeley yo...@lucidworks.com wrote: On Tue, Dec 18, 2012 at 6:28 PM, Steve McKay steve.mc...@gcecloud.com wrote: I'm looking at sending stats facets between shards to speed up merging. Rather than have one node responsible for merging the facet sets from every shard, each facet set is partitioned by term and then each shard merges one partition of each facet set. A-D, E-G, etc. Could you give a concrete example of what you're thinking (say 3 shards and just a few terms?) Take three shards and the field spending_category, which has 6 terms: C, D, G, I, L, O. Currently when a stats request is faceted on spending_category the controller will receive results for each shard with all 6 facets present, and merge the results together. What I'm talking about is having each shard partition its result into {C, D}, {G, I}, {L, O}. Then shard 2 and 3 send facets C and D to shard 1 for merging and likewise for the other shards. Then the result each shard sends back to the controller is independent of the other shard results and merging is trivial. In that example, merging doesn't take significant time either way. What motivates this is doing top-k operations on facet sets of large cardinality, e.g. 1 million unique elements, 200,000 elements being returned by each of 6 shards. Currently, doing all the merging on the controller, a top-10 query spends most of its time merging shard results. Distributing the merge step should significantly improve that. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Hierarchical stats for Solr
Sure can, thanks! On Dec 18, 2012, at 8:12 PM, Ryan McKinley ryan...@gmail.commailto:ryan...@gmail.com wrote: Hi Steve- The work you discuss sounds interesting, can you make a JIRA issue for this? See: http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8-bug_tracker.29 thanks ryan On Tue, Dec 18, 2012 at 3:09 PM, Steve McKay steve.mc...@gcecloud.commailto:steve.mc...@gcecloud.com wrote: e.g. facet by vendor and then facet each vendor by year. I've also added stats.sort, stats.limit, and stats.offset field params. stats.sort syntax is sum|min|max|stdDev|average|sumOfSquares|count|missing|value:asc|desc and limit and offset work as in SQL. Faceting will generally use more RAM and be faster than the 4.0 baseline. I've changed more than some might consider to be strictly necessary; this is because a large part of my effort has been to make faceting performant under adverse conditions, with large result sets and faceting on fields with large (1m+) cardinalities. If there's interest I can post some rough response time numbers for faceting on fields with various cardinalities.
[jira] [Created] (SOLR-4214) Hierarchical stats
Steve McKay created SOLR-4214: - Summary: Hierarchical stats Key: SOLR-4214 URL: https://issues.apache.org/jira/browse/SOLR-4214 Project: Solr Issue Type: New Feature Components: SearchComponents - other Reporter: Steve McKay Hierarchical stats faceting, e.g. facet by vemdor and then facet each vendor by year. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-4214) Hierarchical stats
[ https://issues.apache.org/jira/browse/SOLR-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve McKay updated SOLR-4214: -- Attachment: stats.patch The primary new feature is hierarchical faceting, e.g. facet by vendor and then facet each vendor by year. I've also added stats.sort, stats.limit, and stats.offset field params. stats.sort syntax is sum|min|max|stdDev|average|sumOfSquares|count|missing|value:asc|desc and limit and offset work as in SQL. Faceting will generally use more RAM and be faster than the 4.0 baseline. I've changed more than some might consider to be strictly necessary; this is because a large part of my effort has been to make faceting performant under adverse conditions, with large result sets and faceting on fields with large (1m+) cardinalities. One caveat: distributed stats are broken in this patch due to other work in progress. Tests pass, although I changed a few test cases relating to what happens when stats.field is completely absent in the result set. The existing behavior is to return null as the stats result and my code returns zeroed-out stats, which IMO is more felicitous anyway. The attached patch is diffed from branches/lucene_solr_4_0. Hierarchical stats -- Key: SOLR-4214 URL: https://issues.apache.org/jira/browse/SOLR-4214 Project: Solr Issue Type: New Feature Components: SearchComponents - other Reporter: Steve McKay Attachments: stats.patch Hierarchical stats faceting, e.g. facet by vemdor and then facet each vendor by year. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4528) Slf4JInfoStream - sends InfoStream messages to SLF4J
Steve McKay created LUCENE-4528: --- Summary: Slf4JInfoStream - sends InfoStream messages to SLF4J Key: LUCENE-4528 URL: https://issues.apache.org/jira/browse/LUCENE-4528 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Steve McKay Priority: Minor InfoStream doesn't play well with logging. With Slf4JInfoStream, users can send InfoStream messages to the logging library of their choice for processing. Hooray! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4528) Slf4JInfoStream - sends InfoStream messages to SLF4J
[ https://issues.apache.org/jira/browse/LUCENE-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve McKay updated LUCENE-4528: Attachment: LUCENE-4528.patch Slf4JInfoStream - sends InfoStream messages to SLF4J Key: LUCENE-4528 URL: https://issues.apache.org/jira/browse/LUCENE-4528 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Steve McKay Priority: Minor Attachments: LUCENE-4528.patch InfoStream doesn't play well with logging. With Slf4JInfoStream, users can send InfoStream messages to the logging library of their choice for processing. Hooray! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4528) Slf4JInfoStream - sends InfoStream messages to SLF4J
[ https://issues.apache.org/jira/browse/LUCENE-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13489792#comment-13489792 ] Steve McKay commented on LUCENE-4528: - You're right--I'm not accustomed to thinking about this from upstream's POV. As a user I'd get 80% of the benefit from having this available in Solr, which already depends on SLF4J, so maybe it would be better for me to resubmit this as a config option in Solr? lucene/misc also has no external deps at the moment. Slf4JInfoStream - sends InfoStream messages to SLF4J Key: LUCENE-4528 URL: https://issues.apache.org/jira/browse/LUCENE-4528 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Steve McKay Priority: Minor Attachments: LUCENE-4528.patch InfoStream doesn't play well with logging. With Slf4JInfoStream, users can send InfoStream messages to the logging library of their choice for processing. Hooray! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Invalid XML Output (Arabic): int name=0
Are you sure this isn't an Eclipse issue? The plaintext (quoted-printable) source of the int elements looks like this: int name=3D=D8=A7=D9=84=D9=85=D8=B3=D8=AA=D8=B4=D9=81=D9=89 =D9=88=D9=82=D8=A7=D9=84=D8=AA =D9=8A=D8=B1=D9=88=D8=AD0/int AFAIK this is valid XML. There are only a few codepoints not allowed by XML 1.0. I don't see a right-to-left mark, so it seems like Mail.app is trying to infer a right-to-left mark for the Arabic text and getting it wrong, causing the mangled display. On Aug 29, 2012, at 2:17 PM, Fuad Efendi f...@efendi.ca wrote: Hi all, It looks like we have very special command character here… which mirrors some visible images… but it is still invalid XML when I try to validate in Eclipse… Solr-4.0.0-BETA ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime237/int lst name=params str name=facettrue/str str name=facet.offset1/str str name=facet.sortindex/str str name=facet.limit10/str str name=facet.fieldenrich_keywords_string_mv/str /lst /lst result name=response numFound=0 start=0 /result lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=enrich_keywords_string_mv int name=المستشفى وقالت يروح0/int int name=المستشفى وقالو لي0/int int name=المستشفى وقالوا خلاص0/int int name=المستشفى وقالوا عندك0/int int name=المستشفى وقالوا لا0/int int name=المستشفى وقالوا لابو0/int int name=المستشفى وقالوا لهم0/int int name=المستشفى وقالوا لي0/int int name=المستشفى وقالى تعالى0/int int name=المستشفى وقام بعمل0/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst /response