Distributed result set merging in Solr

2012-12-18 Thread Steve McKay
Currently distributed requests are entirely initiated by whichever node 
receives a query, correct? That is, as far as I know shards don't talk to each 
other or send requests back to the controller.

I'm looking at sending stats facets between shards to speed up merging. Rather 
than have one node responsible for merging the facet sets from every shard, 
each facet set is partitioned by term and then each shard merges one partition 
of each facet set. A-D, E-G, etc. However, that kind of communication doesn't 
really fit into Solr's current model of distributed processing.  I think my use 
case isn't the only instance where it could help performance for shards to talk 
amongst themselves, so I'm curious why nothing in Solr does. Is it deliberate? 
No one bothered? I'm wrong and nothing else has a non-trivial reduce step?


Steve McKay | Software Developer | GCE
steve.mc...@gcecloud.commailto:steve.mc...@gcecloud.com | (703) 390-3044 desk 
| (703) 659-0608 Skype | (443) 710-2762 mobile

Connect with GCE:
www.GCEcloud.comhttp://www.gcecloud.com/ | 
Facebookhttps://www.facebook.com/GCEcloud | 
Twitterhttps://twitter.com/GCECloud | 
Google+https://plus.google.com/u/0/112948441992350338884/posts

The information contained in this e-mail and any attachment(s) is Confidential, 
Privileged, Protected from any disclosure, and proprietary to Global Computer 
Enterprises, Inc.  The person addressed in the email is the sole authorized 
recipient.  If you are not the intended recipient, you are hereby notified that 
any review, use, disclosure, retransmission, dissemination, distribution, 
copying, or any other actions related to this information is strictly 
prohibited. If you have received this communication in error, please inform the 
sender and delete or destroy any copy of this message.




Re: Distributed result set merging in Solr

2012-12-18 Thread Steve McKay
On Dec 18, 2012, at 6:50 PM, Yonik Seeley yo...@lucidworks.com wrote:

 On Tue, Dec 18, 2012 at 6:28 PM, Steve McKay steve.mc...@gcecloud.com wrote:
 I'm looking at sending stats facets between shards to speed up merging.
 Rather than have one node responsible for merging the facet sets from every
 shard, each facet set is partitioned by term and then each shard merges one
 partition of each facet set. A-D, E-G, etc.
 
 Could you give a concrete example of what you're thinking (say 3
 shards and just a few terms?)

Take three shards and the field spending_category, which has 6 terms: C, D, G, 
I, L, O. Currently when a stats request is faceted on spending_category the 
controller will receive results for each shard with all 6 facets present, and 
merge the results together. What I'm talking about is having each shard 
partition its result into {C, D}, {G, I}, {L, O}. Then shard 2 and 3 send 
facets C and D to shard 1 for merging and likewise for the other shards. Then 
the result each shard sends back to the controller is independent of the other 
shard results and merging is trivial.

In that example, merging doesn't take significant time either way. What 
motivates this is doing top-k operations on facet sets of large cardinality, 
e.g. 1 million unique elements, 200,000 elements being returned by each of 6 
shards. Currently, doing all the merging on the controller, a top-10 query 
spends most of its time merging shard results. Distributing the merge step 
should significantly improve that.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Hierarchical stats for Solr

2012-12-18 Thread Steve McKay
Sure can, thanks!

On Dec 18, 2012, at 8:12 PM, Ryan McKinley 
ryan...@gmail.commailto:ryan...@gmail.com wrote:

Hi Steve-

The work you discuss sounds interesting, can you make a JIRA issue for this?

See:
http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8-bug_tracker.29

thanks
ryan


On Tue, Dec 18, 2012 at 3:09 PM, Steve McKay 
steve.mc...@gcecloud.commailto:steve.mc...@gcecloud.com wrote:
e.g. facet by vendor and then facet each vendor by year. I've also added 
stats.sort, stats.limit, and stats.offset field params. stats.sort syntax is 
sum|min|max|stdDev|average|sumOfSquares|count|missing|value:asc|desc and 
limit and offset work as in SQL. Faceting will generally use more RAM and be 
faster than the 4.0 baseline. I've changed more than some might consider to be 
strictly necessary; this is because a large part of my effort has been to make 
faceting performant under adverse conditions, with large result sets and 
faceting on fields with large (1m+) cardinalities. If there's interest I can 
post some rough response time numbers for faceting on fields with various 
cardinalities.




[jira] [Created] (SOLR-4214) Hierarchical stats

2012-12-18 Thread Steve McKay (JIRA)
Steve McKay created SOLR-4214:
-

 Summary: Hierarchical stats
 Key: SOLR-4214
 URL: https://issues.apache.org/jira/browse/SOLR-4214
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Reporter: Steve McKay


Hierarchical stats faceting, e.g. facet by vemdor and then facet each vendor by 
year.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-4214) Hierarchical stats

2012-12-18 Thread Steve McKay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve McKay updated SOLR-4214:
--

Attachment: stats.patch

The primary new feature is hierarchical faceting, e.g. facet by vendor and then 
facet each vendor by year. I've also added stats.sort, stats.limit, and 
stats.offset field params. stats.sort syntax is 
sum|min|max|stdDev|average|sumOfSquares|count|missing|value:asc|desc and 
limit and offset work as in SQL. Faceting will generally use more RAM and be 
faster than the 4.0 baseline. I've changed more than some might consider to be 
strictly necessary; this is because a large part of my effort has been to make 
faceting performant under adverse conditions, with large result sets and 
faceting on fields with large (1m+) cardinalities.

One caveat: distributed stats are broken in this patch due to other work in 
progress. Tests pass, although I changed a few test cases relating to what 
happens when stats.field is completely absent in the result set. The existing 
behavior is to return null as the stats result and my code returns zeroed-out 
stats, which IMO is more felicitous anyway.

The attached patch is diffed from branches/lucene_solr_4_0.


 Hierarchical stats
 --

 Key: SOLR-4214
 URL: https://issues.apache.org/jira/browse/SOLR-4214
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Reporter: Steve McKay
 Attachments: stats.patch


 Hierarchical stats faceting, e.g. facet by vemdor and then facet each vendor 
 by year.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4528) Slf4JInfoStream - sends InfoStream messages to SLF4J

2012-11-02 Thread Steve McKay (JIRA)
Steve McKay created LUCENE-4528:
---

 Summary: Slf4JInfoStream - sends InfoStream messages to SLF4J
 Key: LUCENE-4528
 URL: https://issues.apache.org/jira/browse/LUCENE-4528
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Steve McKay
Priority: Minor


InfoStream doesn't play well with logging. With Slf4JInfoStream, users can send 
InfoStream messages to the logging library of their choice for processing. 
Hooray!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4528) Slf4JInfoStream - sends InfoStream messages to SLF4J

2012-11-02 Thread Steve McKay (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve McKay updated LUCENE-4528:


Attachment: LUCENE-4528.patch

 Slf4JInfoStream - sends InfoStream messages to SLF4J
 

 Key: LUCENE-4528
 URL: https://issues.apache.org/jira/browse/LUCENE-4528
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Steve McKay
Priority: Minor
 Attachments: LUCENE-4528.patch


 InfoStream doesn't play well with logging. With Slf4JInfoStream, users can 
 send InfoStream messages to the logging library of their choice for 
 processing. Hooray!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4528) Slf4JInfoStream - sends InfoStream messages to SLF4J

2012-11-02 Thread Steve McKay (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13489792#comment-13489792
 ] 

Steve McKay commented on LUCENE-4528:
-

You're right--I'm not accustomed to thinking about this from upstream's POV. As 
a user I'd get 80% of the benefit from having this available in Solr, which 
already depends on SLF4J, so maybe it would be better for me to resubmit this 
as a config option in Solr? lucene/misc also has no external deps at the moment.

 Slf4JInfoStream - sends InfoStream messages to SLF4J
 

 Key: LUCENE-4528
 URL: https://issues.apache.org/jira/browse/LUCENE-4528
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Steve McKay
Priority: Minor
 Attachments: LUCENE-4528.patch


 InfoStream doesn't play well with logging. With Slf4JInfoStream, users can 
 send InfoStream messages to the logging library of their choice for 
 processing. Hooray!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Invalid XML Output (Arabic): int name=0

2012-08-30 Thread Steve McKay
Are you sure this isn't an Eclipse issue? The plaintext (quoted-printable) 
source of the int elements looks like this:

int name=3D=D8=A7=D9=84=D9=85=D8=B3=D8=AA=D8=B4=D9=81=D9=89 
=D9=88=D9=82=D8=A7=D9=84=D8=AA =D9=8A=D8=B1=D9=88=D8=AD0/int

AFAIK this is valid XML. There are only a few codepoints not allowed by XML 
1.0. I don't see a right-to-left mark, so it seems like Mail.app is trying to 
infer a right-to-left mark for the Arabic text and getting it wrong, causing 
the mangled display.

On Aug 29, 2012, at 2:17 PM, Fuad Efendi f...@efendi.ca wrote:

 Hi all,
 
 It looks like we have very special command character here… which mirrors 
 some visible images… but it is still invalid XML when I try to validate in 
 Eclipse… 
 Solr-4.0.0-BETA
 
 
 
 ?xml version=1.0 encoding=UTF-8?
 response
   lst name=responseHeader
   int name=status0/int
   int name=QTime237/int
   lst name=params
   str name=facettrue/str
   str name=facet.offset1/str
   str name=facet.sortindex/str
   str name=facet.limit10/str
   str name=facet.fieldenrich_keywords_string_mv/str
   /lst
   /lst
   result
   name=response
   numFound=0
   start=0
   /result
   lst name=facet_counts
   lst name=facet_queries/
   lst name=facet_fields
   lst name=enrich_keywords_string_mv
   int name=المستشفى وقالت يروح0/int
   int name=المستشفى وقالو لي0/int
   int name=المستشفى وقالوا خلاص0/int
   int name=المستشفى وقالوا عندك0/int
   int name=المستشفى وقالوا لا0/int
   int name=المستشفى وقالوا لابو0/int
   int name=المستشفى وقالوا لهم0/int
   int name=المستشفى وقالوا لي0/int
   int name=المستشفى وقالى تعالى0/int
   int name=المستشفى وقام بعمل0/int
   /lst
   /lst
   lst name=facet_dates/
   lst name=facet_ranges/
   /lst
 /response