Re: Solr 3.6 issue - DataImportHandler with CachedSqlEntityProcessor not importing all multi-valued fields

2012-07-04 Thread Mikhail Khludnev
It's hard to troubleshoot without debug logs. Pls pay attention that
regular configuration for CachedSqlEP is slightly different

http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
see

  where=xid=x.id



On Wed, Jun 27, 2012 at 2:29 AM, ps_sra praveens1...@yahoo.com wrote:

 Not sure if this is the right forum to post this question.  If not, please
 excuse.

 I'm trying to use the DataImportHandler with
 processor=CachedSqlEntityProcessor to speed up import from an RDBMS.
 While
 processor=CachedSqlEntityProcessor is much faster than
 processor=SqlEntityProcessor, the resulting Solr index does not contain
 multi-valued fields on sub-entities.

 So, for example, my db-data-config.xml has the following structure:

 document
 ..
 entity name=foo  pk=id

 processor=SqlEntityProcessor
 query=SELECT
 f.id AS foo_id,

   f.name AS foo_name
  FROM
   foo f
  
 field column=foo_id name=foo_id /
 field column=foo_name name=foo_name /


 entity name=bar
 processor=CachedSqlEntityProcessor

 query=SELECT   b.name as bar_name

   FROMbar b

  WHEREb.id = '${foo.id}'
 
  field column=bar_name name=bar_name
 /
 /entity

 /entity
 ..
 /document

 where the database relationship foo:bar is 1:m.

 The issue is that when I import with processor=SqlEntityProcessor ,
 everything works fine and the multi-valued field - bar_name has multiple
 values, while importing with processor=CachedSqlEntityProcessor does not
 even create the bar_name field in the index.

 I've deployed Solr 3.6 on Weblogic 11g, with the patch
 https://issues.apache.org/jira/browse/SOLR-3360 applied.

 Any help on this issue is appreciated.


 Thanks,
 ps

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-3-6-issue-DataImportHandler-with-CachedSqlEntityProcessor-not-importing-all-multi-valued-fields-tp3991449.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Multi-thread UpdateProcessor

2012-07-06 Thread Mikhail Khludnev
Okay, why do you think this idea is not worth to look at?

On Fri, Jul 6, 2012 at 12:53 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Hello,

 Most times when single thread streaming
 http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update is
 used I saw lack of cpu utilization at Solr server. Resonable motivation is
 utilize more threads to index faster, but it requires more complicated
  client side.
 I propose to employ special update processor which can fork the stream
 processing onto many threads. If you like it pls vote for
 https://issues.apache.org/jira/browse/SOLR-3585 .

 Regards

 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Multi-thread UpdateProcessor

2012-07-08 Thread Mikhail Khludnev
some benchmark added. pls check jira

On Fri, Jul 6, 2012 at 11:13 PM, Dmitry Kan dmitry@gmail.com wrote:

 Mikhail,

 you have my +1 and a jira comment :)

 // Dmitry

 On Fri, Jul 6, 2012 at 7:41 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:

  Okay, why do you think this idea is not worth to look at?
 
  On Fri, Jul 6, 2012 at 12:53 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Hello,
  
   Most times when single thread streaming
   http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update is
   used I saw lack of cpu utilization at Solr server. Resonable motivation
  is
   utilize more threads to index faster, but it requires more complicated
client side.
   I propose to employ special update processor which can fork the stream
   processing onto many threads. If you like it pls vote for
   https://issues.apache.org/jira/browse/SOLR-3585 .
  
   Regards
  
   --
   Sincerely yours
   Mikhail Khludnev
   Tech Lead
   Grid Dynamics
  
   http://www.griddynamics.com
mkhlud...@griddynamics.com
  
  
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 



 --
 Regards,

 Dmitry Kan




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Searching for sentences containing a list of words with a configurable number of words not in the list inbetween?

2012-07-10 Thread Mikhail Khludnev
Welcome!

Two points:
- did you choose right maillist? (let me reply to another one)
- have you checked
http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Proximity%20Searches?
- the same in Lucene Queries api is
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/PhraseQuery.htmland
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/spans/SpanNearQuery.html
- it seems to me you should familiarize with explain soon
http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_does_id:archangel_come_before_id:hawkgirl_when_querying_for_.22wings.22

Regards

On Mon, Jul 9, 2012 at 10:28 PM, Svetlana mailingli...@dswp.co.uk wrote:

 Hi,

 I am just about to work through the demo and get to know lucene now I
 actually got it to build :)  I was wondering if someone could point me in
 the right direction for my project.

 I want to query using a list of words but the order that they appear in and
 how common they are is not relevant (i.e. no 'stop words' if I got that
 terminology correct).  The only relevant thing is how closely grouped they
 are and how many of the words in the list occur, and I want to be able to
 configure from 0 (no other non-queried words inbetween) until 'n'
 non-queried words inbetween.

 So for example, if I query for 'a and in house I go together or' (stupid
 example I guess) and specify 0 words inbetween then I would only want to
 get
 hits with those query words in any order sorted by relevance based on how
 many of those words occured.  For example:

 'In a house together' may be the most relevant result

 If I specify 1 other none query word allowed, results may look like

 1. 'In a house together.'
 2. 'In a house sleeping together.'  ('sleeping' being the one extra word
 allowed)

 These should also be complete sentences or clauses, i.e. not 'fragments' -
 I
 guess I need to use a grammar analyser to determine that.

 Any help very much appreciated, I realise that this is probably deceptively
 difficult but if anyone can give some pointers that would be amazing.

 Svetlana

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Searching-for-sentences-containing-a-list-of-words-with-a-configurable-number-of-words-not-in-the-li-tp3993981.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Query for records that have more than N values in a multi-valued field

2012-07-23 Thread Mikhail Khludnev
Hello Alexandre,

Some time ago I wanted to contribute it
http://mail-archives.apache.org/mod_mbox/lucene-dev/201203.mbox/%3ccangii8dukawp7mt1xqrjb5axdqptm5r4z+yzplfc7ptywsq...@mail.gmail.com%3E


On Mon, Jul 23, 2012 at 7:05 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Hello,

 I have a multivalued field and I want to find records that have (for
 example) at least 3 values in that list. Is there an easy way to do
 it?

 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Bulk indexing data into solr

2012-07-26 Thread Mikhail Khludnev
Right in time, guys. https://issues.apache.org/jira/browse/SOLR-3585

Here is server side update processing fork. It does the best for halting
processing on exception occurs.  Plug this UpdateProcessor, specify number
of threads. Then submit lazy iterator into StreamingUpdateServer at client
side.

PS: Don't do the following: send many-many docs one-by-one or instantiate
huge arrayList of SolrInputDocument at client-side.

On Thu, Jul 26, 2012 at 7:46 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/26/2012 7:34 AM, Rafał Kuć wrote:

 If you use Java (and I think you do, because you mention Lucene) you
 should take a look at StreamingUpdateSolrServer. It not only allows
 you to send data in batches, but also index using multiple threads.


 A caveat to what Rafał said:

 The streaming object has no error detection out of the box.  It queues
 everything up internally and returns immediately.  Behind the scenes, it
 uses multiple threads to send documents to Solr, but any errors encountered
 are simply sent to the logging mechanism, then ignored.  When you use
 HttpSolrServer, all errors encountered will throw exceptions, but you have
 to wait for completion.  If you need both concurrent capability and error
 detection, you would have to manage multiple indexing threads yourself.

 Apparently there is a method in the concurrent class that you can override
 and handle errors differently, though I have not seen how to write code so
 your program would know that an error occurred.  I filed an issue with a
 patch to solve this, but some of the developers have come up with an idea
 that might be better.  None of the ideas have been committed to the project.

 https://issues.apache.org/**jira/browse/SOLR-3284https://issues.apache.org/jira/browse/SOLR-3284

 Just an FYI, the streaming class was renamed to ConcurrentUpdateSolrServer
 in Solr 4.0 Alpha.  Both are available in 3.6.x.

 Thanks,
 Shawn




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Bulk indexing data into solr

2012-07-26 Thread Mikhail Khludnev
Coming back to your original question. I'm puzzled a little.
It's not clear where you wanna call Lucene API directly from.
if you mean that you has standalone indexer, which write index files. Then
it stops and these files become available for Solr Process it will work.
Sharing index between processes, or using EmbeddedServer is looking for
problem (despite Lucene has Locks mechanism, which I'm not completely aware
of).
I can conclude that your data for indexing is collocate with the solr
server. In this case consider
http://wiki.apache.org/solr/ContentStream#RemoteStreaming

Please give more details about your design.

On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng 
lisheng.zh...@broadvision.com wrote:


 Hi,

 I am starting to use solr, now I need to index a rather large amount of
 data, it seems
 that calling solr to pass data through HTTP is rather inefficient, I am
 think still call
 lucene API directly for bulk index but to use solr for search, is this
 design OK?

 Thanks very much for helps, Lisheng




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Bulk indexing data into solr

2012-07-26 Thread Mikhail Khludnev
IIRC about a two month ago problem with such scheme discussed here, but I
can remember exact details.
Scheme is generally correct. But you didn't tell how do you let solr know
that it need to reread new index generation, after indexer fsync segments
get.

btw, it might be a possible issue:
https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit()
 Note that this operation calls Directory.sync on the index files. That
call should not return until the file contents  metadata are on stable
storage. For FSDirectory, this calls the OS's fsync. But, beware: some
hardware devices may in fact cache writes even during fsync, and return
before the bits are actually on stable storage, to give the appearance of
faster performance.

you should ensure that after segments.get is fsync'ed, all other index
files are fsynced for other processes too.

Could you tell more about your data:
what's the format?
whether they are located relatively to indexer?
And why you can't use remote streaming by Solr's upd handler or indexer
client app with StreamingUpdateServer ?

On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng 
lisheng.zh...@broadvision.com wrote:

 Hi,

 I think at least before lucene 4.0 we can only allow one process/thread to
 write on
 a lucene folder. Based on this fact my initial plan is:

 1) There is one set of lucene index folders.
 2) Solr server only perform queries in those servers
 3) Having a separate process (multi-threads) to index those lucene folders
 (each
folder is a separate app). Only one thread will index one given lucene
 folder.

 Thanks very much for helps, Lisheng


 -Original Message-
 From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
 Sent: Thursday, July 26, 2012 10:15 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Bulk indexing data into solr


 Coming back to your original question. I'm puzzled a little.
 It's not clear where you wanna call Lucene API directly from.
 if you mean that you has standalone indexer, which write index files. Then
 it stops and these files become available for Solr Process it will work.
 Sharing index between processes, or using EmbeddedServer is looking for
 problem (despite Lucene has Locks mechanism, which I'm not completely aware
 of).
 I can conclude that your data for indexing is collocate with the solr
 server. In this case consider
 http://wiki.apache.org/solr/ContentStream#RemoteStreaming

 Please give more details about your design.

 On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng 
 lisheng.zh...@broadvision.com wrote:

 
  Hi,
 
  I am starting to use solr, now I need to index a rather large amount of
  data, it seems
  that calling solr to pass data through HTTP is rather inefficient, I am
  think still call
  lucene API directly for bulk index but to use solr for search, is this
  design OK?
 
  Thanks very much for helps, Lisheng
 
 


 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Bulk Indexing

2012-07-28 Thread Mikhail Khludnev
Lan,

I assume that some particular server can freeze on such bulk. But overall
message seems not absolutely correct to me. Solr has a lot of mechanisms to
survive in such cases.
Bulk indexing is absolutely right (if you submit single request with long
iterator of SolrInputDocs). This indexing thread can occupy single cpu
core, keeping others ready for searches. Such indexing occupies
ramBufferSizeMB of heap. After limit is exceeded new segment is flushed to
disk, which require some IO and can impact searchers. (misconfigured merge
can ruin everything, of course)
Commit should been executed from business consideration not performance
ones. Commit leads to creating new searcher and warming it, these actions
can be memory and cpu expensive (almost single thread activity).
I did some experiments on 40 M index at desktop box. Constantly adding 1K
docs/sec with autocommit more than once per minute, doesn't have
significant impact on search latency.
Generally, yes. Master-Slave scheme has more performance, for sure.

On Sat, Jul 28, 2012 at 4:01 AM, Lan dung@gmail.com wrote:

 I assume your're indexing on the same server that is used to execute search
 queries. Adding 20K documents in bulk could cause the Solr Server to 'stop
 the world' where the server would stop responding to queries.

 My suggestion is
 - Setup master/slave to insulate your clients from 'stop the world' events
 during indexing.
 - Update in batches with a commit at the end of the batch.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Bulk-Indexing-tp3997745p3997815.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Expression Sort in Solr

2012-07-31 Thread Mikhail Khludnev
Hello,

have you tried http://wiki.apache.org/solr/FunctionQuery/#if ?

On Mon, Jul 30, 2012 at 3:05 PM, lavesh lavesh.ra...@gmail.com wrote:

 I am working on solr for search. I required to perform a expression sort
 such
 that :

 say str = ((IF AVAILABLE IN (1,2,3),100,IF(AVAILABLE IN (4,5,6),80,100)) +
 IF(PRICE1000,70,40))

 need to order by (if(str100,40+str/40,33+str/33)+SOMEOTHERCOLUMN) DESC





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3998050.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Bulk Indexing

2012-07-31 Thread Mikhail Khludnev
Usually collecting whole array hurts client's jvm JVM, sending doc-by-doc
bloats sever by huge number of small requests. You need just rewrite your
code from the eager loop to pulling iterator to be able to submit all docs
via single http request
http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
Then if you wouldn't be happy with low utilization due to using single
thread, post your problem and numbers here again.

http://wiki.apache.org/solr/SolrReplication
http://lucidworks.lucidimagination.com/display/solr/Index+Replication

On Sat, Jul 28, 2012 at 11:21 PM, Sohail Aboobaker sabooba...@gmail.comwrote:

 We have auto commit on and will basically send it in a loop after
 validating each record, we send it to search service. And keep doing it in
 a loop. Mikhail / Lan, are you suggesting that instead of sending it in a
 loop, we should collect them in an array and do a commit at the end? Is
 this better than doing it in a loop with auto commit.

 Also, where can I find some reference on Master / Slave configuration.

 Thanks.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Expression Sort in Solr

2012-07-31 Thread Mikhail Khludnev
how exactly?

On Tue, Jul 31, 2012 at 1:19 PM, lavesh lavesh.ra...@gmail.com wrote:

 yes i have, its not working as per need



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3998050p3998310.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Map Complex Datastructure with Solr

2012-08-02 Thread Mikhail Khludnev
 the possibility to create own FieldTypes, but I don't know
  if this is the answer of my issues...
 
  2012/8/1 Jack Krupansky j...@basetechnology.com:
  The general rule is to flatten the structures. You have a choice
 between
  sharing common fields between tables, such as title, or adding a
  prefix/suffix to qualify them, such as document_title vs.
 product_title.
 
  You also have the choice of storing different tables in separate Solr
  cores/collections, but then you have the burden of querying them
 separately
  and coordinating the separate results on your own. It all depends on
 your
  application.
 
  A lot hinges on:
 
  1. How do you want to search the data?
  2. How do you want to access the fields once the Solr documents have
 been
  identified by a query - such as fields to retrieve, join, etc.
 
  So, once the data is indexed, what are your requirements for
 accessing the
  data? E.g., some sample pseudo-queries and the fields you want to
 access.
 
  -- Jack Krupansky
 
  -Original Message- From: Thomas Gravel
  Sent: Wednesday, August 01, 2012 9:52 AM
  To: solr-user@lucene.apache.org
  Subject: Map Complex Datastructure with Solr
 
 
  Hi,
  how can I map these complex Datastructure in Solr?
 
  Document
 - Groups
  - Group_ID
  - Group_Name
  - .
- Title
- Chapter
  - Chapter_Title
  - Chapter_Content
 
 
  Or
 
  Product
 - Groups
  - Group_ID
  - Group_Name
  - .
- Title
- Articles
  - Artilce_ID
  - Artilce_Color
  - Artilce_Size
 
  Thanks for ideas




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr 4.0 - Join performance

2012-08-02 Thread Mikhail Khludnev
Hello,

You can check my record.
https://issues.apache.org/jira/browse/SOLR-3076?focusedCommentId=13415644page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13415644

I'm still working on precise performance measurement.

On Thu, Aug 2, 2012 at 6:45 PM, Eric Khoury ekhour...@hotmail.com wrote:







 Hello all,



 I’m testing out the new join feature, hitting some perf
 issues, as described in Erick’s article (
 http://architects.dzone.com/articles/solr-experimenting-join).

 Basically, I’m using 2 objects in solr (this is a simplified
 view):



 Item

 - Id

 - Name



 Grant

 - ItemId

 - AvailabilityStartTime

 - AvailabilityEndTime



 Each item can have multiple grants attached to it.



 The query I'm using is the following, to find items by
 name, filtered by grants availability window:



 solr/select?fq=Name:XXXq={!join
 from=ItemId to=Id} AvailabilityStartTime:[* TO NOW] AND
 -AvailabilityEndTime:[*
 TO NOW]



 With a hundred thousand items, this query can take multiple seconds
 to perform, due to the large number or ItemIds returned from the join
 query.

 Has anyone come up with a better way to use joins for these types of
 queries?  Are there improvements planned in 4.0 rtm in this area?



 Btw, I’ve explored simply adding Start-End times to items, but
 the flat data model makes it hard to maintain start-end pairs.



 Thanks for the help!

 Eric.








-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr 4.0 - Join performance

2012-08-02 Thread Mikhail Khludnev
Eric,

you can take last patch from SOLR-3076
 [image: Text File]
https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch
 SOLR-3076.patch
https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch
16/Jul/12 21:16

also can take it applied from
https://github.com/m-khl/solr-patches/tree/6611 . But the origin source
code might be a little bit old.
Regaining a nightly build, it's not so optimistic - I can't attract
committer for reviewing it.

On Thu, Aug 2, 2012 at 11:51 PM, Eric Khoury ekhour...@hotmail.com wrote:

  Wow, great work Mikhail, that's impressive.
 I don't currently have build the dev tree, you wouldn't have a patch for
 the alpha build handy?
 If not, when do you think this'll be available in a nightly build?
 Thanks again,
 Eric.
  From: mkhlud...@griddynamics.com
  Date: Thu, 2 Aug 2012 22:38:13 +0400
  Subject: Re: Solr 4.0 - Join performance
  To: solr-user@lucene.apache.org

 
  Hello,
 
  You can check my record.
 
 https://issues.apache.org/jira/browse/SOLR-3076?focusedCommentId=13415644page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13415644
 
  I'm still working on precise performance measurement.
 
  On Thu, Aug 2, 2012 at 6:45 PM, Eric Khoury ekhour...@hotmail.com
 wrote:
 
  
  
  
  
  
  
   Hello all,
  
  
  
   I’m testing out the new join feature, hitting some perf
   issues, as described in Erick’s article (
   http://architects.dzone.com/articles/solr-experimenting-join).
  
   Basically, I’m using 2 objects in solr (this is a simplified
   view):
  
  
  
   Item
  
   - Id
  
   - Name
  
  
  
   Grant
  
   - ItemId
  
   - AvailabilityStartTime
  
   - AvailabilityEndTime
  
  
  
   Each item can have multiple grants attached to it.
  
  
  
   The query I'm using is the following, to find items by
   name, filtered by grants availability window:
  
  
  
   solr/select?fq=Name:XXXq={!join
   from=ItemId to=Id} AvailabilityStartTime:[* TO NOW] AND
   -AvailabilityEndTime:[*
   TO NOW]
  
  
  
   With a hundred thousand items, this query can take multiple seconds
   to perform, due to the large number or ItemIds returned from the join
   query.
  
   Has anyone come up with a better way to use joins for these types of
   queries? Are there improvements planned in 4.0 rtm in this area?
  
  
  
   Btw, I’ve explored simply adding Start-End times to items, but
   the flat data model makes it hard to maintain start-end pairs.
  
  
  
   Thanks for the help!
  
   Eric.
  
  
  
  
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: search hit on multivalued fields

2012-08-03 Thread Mikhail Khludnev
Mark,

It's not clear what are you want to do. Let's say you requested rows=100
and found 1000 docs. What do you need to show in addition to search result?
- matched field on every of 100 snippets
- or 400 with F1 and 600 with F2
- or what

On Fri, Aug 3, 2012 at 6:41 PM, Jack Krupansky j...@basetechnology.comwrote:

 You can include the fields in your fl list and then check those field
 values explicitly in the client, or you could add debugQuery=true to your
 request and check for which field the term matched in. The latter requires
 that you have the analyzed term (or check for closest matching term).

 -- Jack Krupansky

 -Original Message- From: Mark , N
 Sent: Friday, August 03, 2012 5:51 AM
 To: solr-user@lucene.apache.org
 Subject: search hit on multivalued fields


 I have a multivalued field  Tex which is indexed , for example :

 F1:  some value
 F2: some value
 Text = ( content of f1,f2)

 When user search , I am checking only a  Text field but i would also need
 to display to users which Field ( F1 or F2 )  resulted the search hit
 Is it possible in SOLR  ?


 --
 Thanks,

 *Nipen Mark *




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Thread Blocking - Apache Solr 3.6.1

2012-08-05 Thread Mikhail Khludnev
)
 at org.mortbay.jetty.Server.handle(Server.java:326)
 at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at

 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at

 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
 at

 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Locked ownable synchronizers:
 - None
 1854553074@qtp-924653460-15 - Thread t@69
java.lang.Thread.State: BLOCKED
 at java.util.logging.StreamHandler.publish(Unknown Source)
 - waiting to lock 23efc88b (a java.util.logging.ConsoleHandler)
 owned by 1462043760@qtp-924653460-20 t@77
 at java.util.logging.ConsoleHandler.publish(Unknown Source)
 at java.util.logging.Logger.log(Unknown Source)
 at
 org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
 at
 org.slf4j.impl.JDK14LoggerAdapter.info(JDK14LoggerAdapter.java:285)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1378)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 at

 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at

 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at

 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
 at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at

 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at

 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
 at

 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Locked ownable synchronizers:
 - None
 440688079@qtp-924653460-8 - Acceptor1 SocketConnector@0.0.0.0:8983 -
 Thread t@26
java.lang.Thread.State: BLOCKED
 at java.net.PlainSocketImpl.accept(Unknown Source)
 - waiting to lock 5b5bd00c (a java.net.SocksSocketImpl) owned by
 370915326@qtp-924653460-9 - Acceptor0 SocketConnector@0.0.0.0:8983 t@27
 at java.net.ServerSocket.implAccept(Unknown Source)
 at java.net.ServerSocket.accept(Unknown Source)
 at
 org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:99)
 at

 org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
 at

 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Locked ownable synchronizers:
 - None
 1422284074@qtp-924653460-7 - Acceptor2 SocketConnector@0.0.0.0:8983 -
 Thread t@25
java.lang.Thread.State: BLOCKED
 at java.net.PlainSocketImpl.accept(Unknown Source)
 - waiting to lock 5b5bd00c (a java.net.SocksSocketImpl) owned by
 370915326@qtp-924653460-9 - Acceptor0 SocketConnector@0.0.0.0:8983 t@27
 at java.net.ServerSocket.implAccept(Unknown Source)
 at java.net.ServerSocket.accept(Unknown Source)
 at
 org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:99)
 at

 org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
 at

 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Locked ownable synchronizers:
 - None



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Thread-Blocking-Apache-Solr-3-6-1-tp3999191.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Is this too much time for full Data Import?

2012-08-08 Thread Mikhail Khludnev
Hello,

Does your indexer utilize CPU/IO? - check it by iostat/vmstat.
If it doesn't, take several thread dumps by jvisualvm sampler or jstack,
try to understand what blocks your threads from progress.
It might happen you need to speedup your SQL data consumption, to do this,
you can enable threads in DIH (only in 3.6.1), move from N+1 SQL queries to
select all/cache approach
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor and
https://issues.apache.org/jira/browse/SOLR-2382

Good luck

On Wed, Aug 8, 2012 at 9:16 AM, Pranav Prakash pra...@gmail.com wrote:

 Folks,

 My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL
 queries for each document. The database servers are different from Solr
 Servers. Each document has an update processor chain which (a) calculates
 signature of the document using SignatureUpdateProcessorFactory and (b)
 Finds out terms which have term frequency  2; using a custom processor.
 The index size is ~ 480GiB

 I want to know if the amount of time taken is too large compared to the
 document count? How do I benchmark the stats and what are some of the ways
 I can improve this? I believe there are some optimizations that I could do
 at Update Processor Factory level as well. What would be a good way to get
 dirty on this?

 *Pranav Prakash*

 temet nosce




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Does Solr support 'Value Search'?

2012-08-08 Thread Mikhail Khludnev
Hello,

Have you checked
http://lucidworks.lucidimagination.com/display/lweug/Wildcard+Queries ?

On Wed, Aug 8, 2012 at 12:56 AM, Bing Hua bh...@cornell.edu wrote:

 Hi folks,

 Just wondering if there is a query handler that simply takes a query string
 and search on all/part of fields for field values?

 e.g.
 q=*admin*

 Response may look like
 author: [admin, system_admin, sub_admin]
 last_modifier: [admin, system_admin, sub_admin]
 doctitle: [AdminGuide, AdminManual]



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Does Solr support 'Value Search'?

2012-08-08 Thread Mikhail Khludnev
Ok. It seems to me you can configure
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactoryfor
index-time to produce admin term from all your docs above, after
that
you'll be able to match by simple term query.
Is it what are you looking for?

On Wed, Aug 8, 2012 at 6:43 PM, Bing Hua bh...@cornell.edu wrote:

 Thanks for the response but wait... Is it related to my question searching
 for field values? I was not asking how to use wildcards though.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p3999817.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Does Solr support 'Value Search'?

2012-08-09 Thread Mikhail Khludnev
Ok. this explanation is much cleaner. Have you tried to invoke
http://wiki.apache.org/solr/TermsComponent/ against all fields which you
need?

On Wed, Aug 8, 2012 at 10:56 PM, Bing Hua bh...@cornell.edu wrote:

 Not quite understand but I'd explain the problem I had. The response would
 contain only fields and a list of field values that match the query.
 Essentially it's querying for field values rather than documents. The
 underlying use case would be, when typing in a quick search box, the drill
 down menu may contain matches on authors, on doctitles, and potentially on
 other fields.

 Still thanks for your response and hopefully I'm making it clearer.
 Bing



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p327.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Does Solr support 'Value Search'?

2012-08-09 Thread Mikhail Khludnev
Sure, Lucene is kind of column oriented DB. if the same text occurs in two
different fields there is no any relation between such terms i.e. BRAND:RED
vs COLOR:RED. The only thing I can suggest you is build separate index (in
solr core) with docs like token:RED; fields:{COLOR, BRAND,,} or giving your
initial sample:
{token:admin; field:author; original_text:system_admin}
{token:admin; field:author; original_text:admin}
{token:admin; field:doctitle; original_text:AdminGuide}
...
then you can search by token:admin and find such occurrence documents.

On Thu, Aug 9, 2012 at 10:50 PM, Bing Hua bh...@cornell.edu wrote:

 Thanks Kuli and Mikhail,

 Using either termcomponent or suggester I could get some suggested terms
 but
 it's still confusing me how to get the respective field names. In order to
 get that, Use TermComponent I'll need to do a term query to every possible
 field. Similar things as using SpellCheckComponent. CopyField won't help
 since I want the original field name.

 Any suggestions?
 Bing



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p4000267.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr 4.0 - Join performance

2012-08-14 Thread Mikhail Khludnev
Eric,

Unfortunately Solr guys ignores it.

On Tue, Aug 14, 2012 at 7:48 PM, Eric Khoury ekhour...@hotmail.com wrote:


 Hi Mikhail, was trying to figure out if solr-3076 made it into the beta,
 but since the issue is still marked as opened, I take it it didn't
 yet?Thanks,Eric.
   From: mkhlud...@griddynamics.com
  Date: Fri, 3 Aug 2012 00:06:36 +0400
  Subject: Re: Solr 4.0 - Join performance
  To: ekhour...@hotmail.com; solr-user@lucene.apache.org
 
  Eric,
 
  you can take last patch from SOLR-3076
   [image: Text File]
  
 https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch
   SOLR-3076.patch
  
 https://issues.apache.org/jira/secure/attachment/12536717/SOLR-3076.patch
  16/Jul/12 21:16
 
  also can take it applied from
  https://github.com/m-khl/solr-patches/tree/6611 . But the origin source
  code might be a little bit old.
  Regaining a nightly build, it's not so optimistic - I can't attract
  committer for reviewing it.
 
  On Thu, Aug 2, 2012 at 11:51 PM, Eric Khoury ekhour...@hotmail.com
 wrote:
 
Wow, great work Mikhail, that's impressive.
   I don't currently have build the dev tree, you wouldn't have a patch
 for
   the alpha build handy?
   If not, when do you think this'll be available in a nightly build?
   Thanks again,
   Eric.
From: mkhlud...@griddynamics.com
Date: Thu, 2 Aug 2012 22:38:13 +0400
Subject: Re: Solr 4.0 - Join performance
To: solr-user@lucene.apache.org
  
   
Hello,
   
You can check my record.
   
  
 https://issues.apache.org/jira/browse/SOLR-3076?focusedCommentId=13415644page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13415644
   
I'm still working on precise performance measurement.
   
On Thu, Aug 2, 2012 at 6:45 PM, Eric Khoury ekhour...@hotmail.com
   wrote:
   






 Hello all,



 I’m testing out the new join feature, hitting some perf
 issues, as described in Erick’s article (
 http://architects.dzone.com/articles/solr-experimenting-join).

 Basically, I’m using 2 objects in solr (this is a simplified
 view):



 Item

 - Id

 - Name



 Grant

 - ItemId

 - AvailabilityStartTime

 - AvailabilityEndTime



 Each item can have multiple grants attached to it.



 The query I'm using is the following, to find items by
 name, filtered by grants availability window:



 solr/select?fq=Name:XXXq={!join
 from=ItemId to=Id} AvailabilityStartTime:[* TO NOW] AND
 -AvailabilityEndTime:[*
 TO NOW]



 With a hundred thousand items, this query can take multiple seconds
 to perform, due to the large number or ItemIds returned from the
 join
 query.

 Has anyone come up with a better way to use joins for these types
 of
 queries? Are there improvements planned in 4.0 rtm in this area?



 Btw, I’ve explored simply adding Start-End times to items, but
 the flat data model makes it hard to maintain start-end pairs.



 Thanks for the help!

 Eric.




   
   
   
   
--
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics
   
http://www.griddynamics.com
mkhlud...@griddynamics.com
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com





-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Diversifying Search Results - Custom Collector

2012-08-20 Thread Mikhail Khludnev
Hello,

I've got the problem description below. Can you explain the expected user
experience, and/or solution approach before diving into the algorithm
design?

Thanks

On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj 
karthick.soundara...@gmail.com wrote:

 My problem is that when there are a lot of documents representing products,
 products from same manufacturer seem to appear in close proximity in the
 results and therefore, it doesnt provide brand diversity. When you search
 for sofas, you get sofas from a manufacturer A dominating the first page
 while the sofas from manufacturer B dominating the second page, etc. The
 issue here is that a manufacturer tends to describes the different sofas he
 produces the same way and therefore there is a very little difference
 between the documents representing two sofas.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Diversifying Search Results - Custom Collector

2012-08-20 Thread Mikhail Khludnev
Hello,

I don't believe your task can be solved by playing with scoring/collector
or shuffling.
For me it's absolutely Grouping usecase (despite I don't really know this
feature well).

 Grouping cannot solve the problem because I dont want to limit the number
of results showed based on the grouping field.

I'm not really getting it. why you can set limit to 11 and just show the
labels like [+] show 6 result.. or if you have 11 [+] show more than 10
..

If you experience problem with constructing search result page, I can
suggest submit search request with rows=0facet.field=BRAND, then your
algorithm can choose number of necessary items per every brand and submit
rows=Xfq=BRAND:Y it gives you arbitrarily sizes for groups.

Will this work for you?

On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj 
d.s.karth...@gmail.com wrote:

 Tanguy,
   You idea is perfect for cases where there is a too many
 documents with 80-90% documents having same value for a particular field.
 As an example, your idea is ideal for, lets say we have 10 documents in
 total like this,

  doc1 : merchantName Kellog's /merchantName
  doc2 : merchantName Kellog's /merchantName
  doc3 : merchantName Kellog's /merchantName
  doc4 : merchantName Kellog's /merchantName
  doc5 : merchantName Kellog's /merchantName
  doc6 : merchantName Kellog's /merchantName
  doc7 : merchantName Kellog's /merchantName
  doc8 : merchantName Nestle /merchantName
  doc9 : merchantName Kellog's /merchantName
  doc10 : merchantName Kellog's /merchantName

 But I have
  doc1 : merchantName Maggi /merchantName
  doc2 : merchantName Maggi  /merchantName
  doc3 : merchantName MM's /merchantName
  doc4 : merchantName MM's /merchantName
  doc5 : merchantName Hershey's /merchantName
  doc6 : merchantName Hershey's /merchantName
  doc7 : merchantName Nestle /merchantName
  doc8 : merchantName Nestle /merchantName
  doc9 : merchantName Kellog's /merchantName
  doc10 : merchantName Kellog's /merchantName


 Thanks,
 Karthick

 On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal tanguy.m...@gmail.comwrote:

 Hello,

 I don't know if that could help, but if I understood your issue, you have
 a lot of documents with the same or very close scores. Moreover I think you
 get your matches in Merchant order (more or less) because they must be
 indexed in that very same order, so solr returns documents of same scores
 in insertion order (although there is no contract specifying this)

 You could work around that issue by :
 1/ Turning off tf/idf because you're searching in documents with little
 text where only the match counts, but frequencies obviously aren't helping.
 2/ Add a random number to each document at index time, and boost on that
 random value at query time, this will shuffle your results, that's probably
 the simplest thing to do.

 Hope this helps,

 Tanguy

 2012/8/20 Karthick Duraisamy Soundararaj d.s.karth...@gmail.com

 Hello Mikhail,
 Thank you for the reply. In terms of user
 experience, I want to spread out the products from same brand farther from
 each other, *atleast* in the first 50-100 results we display. I am
 thinking about two different approaches as solution.

   1. For first few results, display one top scoring
 product of a manufacturer  (For a given field, display the top scoring
 results of the unique field values for the first N matches) . This N could
 be either a percentage relative to total matches or a configurable absolute
 value.
   2. Enforce a penalty on  the score for the results
 that have duplicate field values. The penalty can be enforced such a way
 that, the results with higher scores will not be affected as against the
 ones with lower score.

 Both of the solutions can be implemented while sorting the documents
 with TopFieldCollector / TopScoreDocCollector.

 Does this answer your question?  Please let me know if you have any more
 questions.

 Thanks,
 Karthick

 On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Hello,

 I've got the problem description below. Can you explain the expected
 user experience, and/or solution approach before diving into the algorithm
 design?

 Thanks


 On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj 
 karthick.soundara...@gmail.com wrote:

 My problem is that when there are a lot of documents representing
 products,
 products from same manufacturer seem to appear in close proximity in
 the
 results and therefore, it doesnt provide brand diversity. When you
 search
 for sofas, you get sofas from a manufacturer A dominating the first
 page
 while the sofas from manufacturer B dominating the second page, etc.
 The
 issue here is that a manufacturer tends to describes the different
 sofas he
 produces the same way and therefore there is a very little difference
 between the documents representing two sofas.




 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-23 Thread Mikhail Khludnev
Tom,
Feel free to find my benchmark results for two alternative joining
approaches.
http://blog.griddynamics.com/2012/08/block-join-query-performs.html

Regards

On Thu, Aug 23, 2012 at 4:40 PM, Erick Erickson erickerick...@gmail.comwrote:

 Tom:

 I thin my comments were that grouping on a field where there was
 a unique value _per document_ chewed up a lot of resources.
 Conceptually, there's a bucket for each unique group value. And
 grouping on a file path is just asking for trouble.

 But the memory used for grouping should max as a function of
 the unique values in the grouped field.

 Best
 Erick

 On Wed, Aug 22, 2012 at 11:32 PM, Lance Norskog goks...@gmail.com wrote:
  Yes, distributed grouping works, but grouping takes a lot of
  resources. If you can avoid in distributed mode, so much the better.
 
  On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West tburt...@umich.edu
 wrote:
  Thanks Tirthankar,
 
  So the issue in memory use for sorting.  I'm not sure I understand how
  sorting of grouping fields  is involved with the defaults and field
  collapsing, since the default sorts by relevance not grouping field.  On
  the other hand I don't know much about how field collapsing is
 implemented.
 
  So far the few tests I've made haven't revealed any memory problems.  We
  are using very small string fields for grouping and I think that we
  probably only have a couple of cases where we are grouping more than a
 few
  thousand docs.   I will try to find a query with a lot of docs per group
  and take a look at the memory use using JConsole.
 
  Tom
 
 
  On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee 
  tchatter...@commvault.com wrote:
 
   Hi Tom,
 
  We had an issue where we are keeping millions of docs in a single node
 and
  we were trying to group them on a string field which is nothing but
 full
  file path… that caused SOLR to go out of memory…
 
  ** **
 
  Erick has explained nicely in the thread as to why it won’t work and I
 had
  to find another way of architecting it. 
 
  ** **
 
  How do you think this is different in your case. If you want to group
 by a
  string field with thousands of similar entries I am guessing you will
 face
  the same issue. 
 
  ** **
 
  Thanks,
 
  Tirthankar
  ***Legal Disclaimer***
  This communication may contain confidential and privileged material
 for
  the
  sole use of the intended recipient. Any unauthorized review, use or
  distribution
  by others is strictly prohibited. If you have received the message in
  error,
  please advise the sender by reply email and delete the message. Thank
 you.
  **
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


RE: Solr Not releasing memory

2012-09-03 Thread Mikhail Khludnev
Rohit,
Which collector do you use? Releasing physical ram is possible with
compacting collectors like serial, parallel and maybe g1 and not possible
with cms. The more important thing that releasing is really suspicious and
even odd requrement. Please provide more details about your jvm and overall
challenge.
03.09.2012 15:03 пользователь Rohit ro...@simplify360.com написал:

 I am currently using StandardDirectoryFactory, would switching directory
factory have any impact on the indexes?

 Regards,
 Rohit


 -Original Message-
 From: Claudio Ranieri [mailto:claudio.rani...@estadao.com]
 Sent: 03 September 2012 10:03
 To: solr-user@lucene.apache.org
 Subject: RES: Solr Not releasing memory

 Are you using MMapDirectoryFactory?
 I had swap problem in linux to a big index when I used
MMapDirectoryFactory.
 You can to try use solr.NIOFSDirectoryFactory.


 -Mensagem original-
 De: Lance Norskog [mailto:goks...@gmail.com] Enviada em: domingo, 2 de
setembro de 2012 22:00
 Para: solr-user@lucene.apache.org
 Assunto: Re: Solr Not releasing memory

 1) I believe Java 1.7 release memory back to the OS.
 2) All of the Javas I've used on Windows do this.

 Is the physical memory use a problem? Does it push out all other programs?

 Or is it just that the Java process appears larger? This explains the
latter:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 - Original Message -
 | From: Rohit ro...@simplify360.com
 | To: solr-user@lucene.apache.org
 | Sent: Sunday, September 2, 2012 1:22:14 AM
 | Subject: Solr Not releasing memory
 |
 | Hi,
 |
 |
 |
 | We are running solr3.5 using tomcal 6.26  on a Windows Enterprise RC2
 | server, our index size if pretty large.
 |
 |
 |
 | We have noticed that once tomcat starts using/reserving ram it never
 | releases them, even when there is not a single user on the system.  I
 | have tried forced garbage collection, but that doesn't seem to help
 | either.
 |
 |
 |
 | Regards,
 |
 | Rohit
 |
 |
 |
 |




Re: Solr New Version causes NIO Closed Channel Exception

2012-09-03 Thread Mikhail Khludnev
Hi
Does mmap directory works for you?
03.09.2012 19:20 пользователь Pavitar Singh psi...@sprinklr.com написал:

 Hi,

 We are facing this problem repeatedly and it goes away on restarts.


 [#|2012-09-01T12:07:06.947+|SEVERE|glassfish3.1|org.apache.solr.core.SolrCore|_ThreadID=712;_ThreadName=Thread-2;|java.nio.channels.ClosedChannelException
 at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:88)
 at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:613)
 at

 org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:161)
 at

 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:160)
 at

 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
 at org.apache.lucene.store.DataInput.readVInt(DataInput.java:86)
 at

 org.apache.lucene.index.codecs.standard.StandardPostingsReader$SegmentDocsEnum.read(StandardPostingsReader.java:300)
 at org.apache.lucene.search.TermScorer.refillBuffer(TermScorer.java:74)
 at org.apache.lucene.search.TermScorer.nextDoc(TermScorer.java:121)
 at org.apache.lucene.search.TermScorer.score(TermScorer.java:70)
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:210)
 at org.apache.lucene.search.Searcher.search(Searcher.java:101)
 at

 org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1289)
 at

 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1099)
 at
 org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:358)
 at

 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:423)
 at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:215)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:279)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at

 org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655)
 at
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595)
 at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:98)
 at

 com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:91)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:162)
 at

 org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655)
 at
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595)
 at

 org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:323)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:227)
 at

 com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:170)
 at
 com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:822)
 at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:719)
 at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1013)
 at

 com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:225)
 at

 com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137)
 at
 com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104)
 at
 com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90)
 at
 com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79)
 at

 com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54)
 at

 com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
 at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
 at

 com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
 at

 com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
 at java.lang.Thread.run(Thread.java:619)
 |#]



RE: Solr Not releasing memory

2012-09-03 Thread Mikhail Khludnev
Rohit,

Why do you think it should free it during idle time? Let us what numbers
you are actually watching. Check this it can be intetesting
blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
04.09.2012 0:45 пользователь Markus Jelsma markus.jel...@openindex.io
написал:

 You've got more than 45GB of physical RAM in your machine? I assume it's
 actually virtual memory you're seeing, which is not a problem, even on
 Windows. It's not uncommon for resident memory to be higher than the
 allocated heap space and it's normal to have a high virtual memory address
 space if you have a large index.

 -Original message-
  From:Rohit ro...@simplify360.com
  Sent: Tue 04-Sep-2012 00:33
  To: solr-user@lucene.apache.org
  Subject: RE: Solr Not releasing memory
 
  I am taking of Physical memory here, we start at -Xms of 2gb but very
 soon it goes high as 45Gb. The memory never comes down even when a single
 user is not using the system.
 
  Regards,
  Rohit
 
 
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: 03 September 2012 14:58
  To: solr-user@lucene.apache.org
  Subject: RE: Solr Not releasing memory
 
  It would be helpful yo know which memory isn't being released. Is it
 virtual or physical or shared memory? Is it the heap space?
 
 
  -Original message-
   From:Mikhail Khludnev mkhlud...@griddynamics.com
   Sent: Mon 03-Sep-2012 16:52
   To: solr-user@lucene.apache.org
   Subject: RE: Solr Not releasing memory
  
   Rohit,
   Which collector do you use? Releasing physical ram is possible with
   compacting collectors like serial, parallel and maybe g1 and not
   possible with cms. The more important thing that releasing is really
   suspicious and even odd requrement. Please provide more details about
   your jvm and overall challenge.
   03.09.2012 15:03 пользователь Rohit ro...@simplify360.com написал:
   
I am currently using StandardDirectoryFactory, would switching
directory
   factory have any impact on the indexes?
   
Regards,
Rohit
   
   
-Original Message-
From: Claudio Ranieri [mailto:claudio.rani...@estadao.com]
Sent: 03 September 2012 10:03
To: solr-user@lucene.apache.org
Subject: RES: Solr Not releasing memory
   
Are you using MMapDirectoryFactory?
I had swap problem in linux to a big index when I used
   MMapDirectoryFactory.
You can to try use solr.NIOFSDirectoryFactory.
   
   
-Mensagem original-
De: Lance Norskog [mailto:goks...@gmail.com] Enviada em: domingo, 2
de
   setembro de 2012 22:00
Para: solr-user@lucene.apache.org
Assunto: Re: Solr Not releasing memory
   
1) I believe Java 1.7 release memory back to the OS.
2) All of the Javas I've used on Windows do this.
   
Is the physical memory use a problem? Does it push out all other
 programs?
   
Or is it just that the Java process appears larger? This explains
the
   latter:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.h
tml
   
- Original Message -
| From: Rohit ro...@simplify360.com
| To: solr-user@lucene.apache.org
| Sent: Sunday, September 2, 2012 1:22:14 AM
| Subject: Solr Not releasing memory
|
| Hi,
|
|
|
| We are running solr3.5 using tomcal 6.26  on a Windows Enterprise
| RC2 server, our index size if pretty large.
|
|
|
| We have noticed that once tomcat starts using/reserving ram it
| never releases them, even when there is not a single user on the
| system.  I have tried forced garbage collection, but that doesn't
| seem to help either.
|
|
|
| Regards,
|
| Rohit
|
|
|
|
   
   
  
 
 
 



Re: Re: Get parent when the child is a search hit

2012-09-10 Thread Mikhail Khludnev
Hello,
One more approach is BlockJoin. see SOLR-3076
blog.griddynamics.com/2012/08/block-join-query-performs.html
 11.09.2012 5:40 пользователь 李�S liyun2...@corp.netease.com написал:

 I think denormalize the data is the best way.

 2012-09-11



 李�S



 发件人:jimtronic
 发送时间:2012-09-11 01:38
 主题:Re: Get parent when the child is a search hit
 收件人:solr-usersolr-user@lucene.apache.org
 抄送:

 You could create a type field with folder or file as values and then
 have the parentid present in the folder docs.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Get-parent-when-the-child-is-a-search-hit-tp4006623p4006687.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can solr return matched fields?

2012-09-13 Thread Mikhail Khludnev
Dan,

if you have foo bar search phrase against field: NAME, BRAND, and you
have 10K docs matched and 100 first ones is displayed, what do you actually
want to see as fields the query matched and for which docs?

looking forward for additional details.

On Thu, Sep 13, 2012 at 2:40 AM, Jack Krupansky j...@basetechnology.comwrote:

 But presumably matched fields relates to indexed fields, which might not
 have stored values.

 -- Jack Krupansky

 -Original Message- From: Casey Callendrello
 Sent: Wednesday, September 12, 2012 6:15 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Can solr return matched fields?




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: MultiSearchHandler - Boosting results of a Query

2012-09-13 Thread Mikhail Khludnev
1. please explain how exactly boost the value of a field in the 2nd based
on the results of the 1st.. Please provide sample queries, docs and
results.
2. introducing such chaining concern aka * ResponseAware* seems
absolutely doubtful for me.
3. are you sure you are aware of tricks like
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F
 and http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
 and http://wiki.apache.org/solr/QueryElevationComponent

On Thu, Sep 13, 2012 at 11:09 PM, Karthick Duraisamy Soundararaj 
karthick.soundara...@gmail.com wrote:

 Clarification:
  Once the parser is response aware, its easy for the components to grab
 the response and use them.   In the context of function queries, by
 components, I mean various Functions that has been extended from
 ValueSource.

 On Thu, Sep 13, 2012 at 3:02 PM, Karthick Duraisamy Soundararaj 
 karthick.soundara...@gmail.com wrote:

 Hello all,
 I am making multiple queries in a single url and trying
 to boost the value of a field in the 2nd based on the results of the
 1st. To achieve this, my function query should be able to have access to
 the response of the first query. However, QParser and QParserPlugin only
 accepts req parameter and does not have any idea about the response.

 In a nutshell, all I am trying to do is that, during a serial execution
 of  chain of queries represented by a single url(
 https://issues.apache.org/jira/browse/SOLR-1093), I am trying to
 influence the results of  the second query with the results of the first
 query. To make the function queries ResponseAware, there are two options:

 *Option 1: Make all the QueryParsers ResponseAware*
 For this the following changes seem to be inevitable
 1. Change/overload the createParser definition of QParserPlugin
 to include SolrQueryResponse
   createParser(string qstr, SolrParams
 localParams, SolrParams params, SolrQueryResponse rsp)
 - createParser(string qstr, SolrParams localParams, SolrParams params,
 SolrQueryRequest req, SolrQueryResponse rsp)
 2. Make similar changes to the getParser function in QPareser


 *Option 2: Make FunctionQueryParser alone ResponseAware*
 For this, following changes need to be made
1. Overload the FunctionQueryParserPlugin's create method with the
 following signature
createParser(string qstr, SolrParams localParams,
 SolrParams params, SolrQueryRequest req, SolrQueryResponse rsp)
2. Overload the getParser methong in QParser to permit the extra
 SolrResponse parameter and invoke this call wherever necessary.


 Once the parser is response aware, its easy for the components to grab
 the response and use them.

 This change to interface would mandate changes across the various
 components of SOLR that use all the different kind of parsers but I think
 this would be a useful feature as it has been requested by different people
 at various times. I would appreciate any kind of
 suggestions/feedback. Also, I would be more than happy to discuss if there
 are anyother way of doing the same.









-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: MMapDirectory

2012-09-20 Thread Mikhail Khludnev
My limited understanding, confirmed by profiler though, is that doing mmap
IO cost you a copying bytes from mmaped virtual memory into heap VM. Just
look into java.nio.DirectByteBuffer.get(byte[], int, int) . It happens
several times to me - we saw hotspot  in profiler on mmaped IO (yep, just
in copying bytes!!), cache them in heap and we had hotspot moved after that.
Good sample of heap cache for mmaped data is terminfos cache with
configurable interval.
Overal question is absolutely worth to think about.

On Thu, Sep 20, 2012 at 9:39 PM, Erick Erickson erickerick...@gmail.comwrote:

 So I just had a curiosity question pop up and wanted to check it out.
 Solr has the documentCache, designed to hold stored fields while
 various parts of a requestHandler do their tricks, keeping the stored
 content from having to be re-fetched from disk. When using
 MMapDirectory, is this even something to worry about?

 It seems like documentCache wouldn't be all that useful, but then I
 don't have a deep understanding here. I can imagine scenarios where it
 would be more efficient i.e. it's targeted to the documents actually
 being accessed rather than random places on disk in the fdt/fdx
 files

 Thanks,
 Erick




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: DIH problem

2012-09-21 Thread Mikhail Khludnev
Gian,

The only way to handle it is to provide a test case and attach to jira.

Thanks

On Fri, Sep 21, 2012 at 6:03 PM, Gian Marco Tagliani
gm.tagli...@gmail.comwrote:

 Hi,
 I'm updating my Solr from version 3.4 to version 3.6.1 and I'm facing a
 little problem with the DIH.

 In the delta-import I'm using the /parentDeltaQuery/ feature of the DIH to
 update the parent entity.
 I don't think this is working properly.

 I realized that it's just executing the /parentDeltaQuery/ with the first
 record of the /deltaQuery /result.
 Comparing the code with the previous versions I noticed that the
 rowIterator was never set to null.

 To solve this I wrote a simple patch:

 -
 Index: solr/contrib/**dataimporthandler/src/java/**
 org/apache/solr/handler/**dataimport/**EntityProcessorBase.java
 ==**==**===
 --- solr/contrib/**dataimporthandler/src/java/**org/apache/solr/handler/**
 dataimport/**EntityProcessorBase.java (revision 31454)
 +++ solr/contrib/**dataimporthandler/src/java/**org/apache/solr/handler/**
 dataimport/**EntityProcessorBase.java (working copy)
 @@ -121,6 +121,7 @@
  if (rowIterator.hasNext())
return rowIterator.next();
  query = null;
 +rowIterator = null;
  return null;
} catch (Exception e) {
  SolrException.log(log, getNext() failed for query ' + query +
 ', e);
 -


 Do you think this is correct?

 Thanks for your help

 --
 Gian Marco Tagliani






-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Return only matched multiValued field

2012-09-24 Thread Mikhail Khludnev
Hi
It seems like highlighting feature.
24.09.2012 0:51 пользователь Dotan Cohen dotanco...@gmail.com написал:

 Assuming a multivalued, stored and indexed field with name comment.
 When performing a search, I would like to return only the values of
 comment which contain the match. For example:

 When searching for gold instead of getting this result:

 doc
 arr name=comment
 strTheres a lady whos sure/str
 strall that glitters is gold/str
 strand shes buying a stairway to heaven/str
 /arr
 /doc

 I would prefer to get this result:

 doc
 arr name=comment
 strall that glitters is gold/str
 /arr
 /doc

 (psuedo-XML from memory, may not be accurate but illustrates the point)

 Is there any way to do this with a Solr 4 index? The client accessing
 Solr is on a dial-up connection (no provision for DSL or other high
 speed internet) so I'd like to move as little data over the wire as
 possible. In reality, the array will have tens of fields so returning
 only the relevant fields may reduce the data transferred by an order
 of magnitude.

 Thanks.

 --
 Dotan Cohen

 http://gibberish.co.il
 http://what-is-what.com



Re: Getting the distribution information of scores from query

2012-09-26 Thread Mikhail Khludnev
I suggest to create a component, put it after QueryComponent. in prepare it
should add own PostFilter into list of request filters, your post filter
will be able to inject own DelegatingCollector, then you can just add
collected histogram into result named list
 http://searchhub.org/dev/2012/02/10/advanced-filter-caching-in-solr/

On Tue, Sep 25, 2012 at 10:03 PM, Amit Nithian anith...@gmail.com wrote:

 We have a federated search product that issues multiple parallel
 queries to solr cores and fetches the results and blends them. The
 approach we were investigating was taking the scores, normalizing them
 based on some distribution (normal distribution seems reasonable) and
 use that z score as the way to blend the results (else you'll be
 blending scores on different scales). To accomplish this, I was
 looking to get the distribution of the scores for the query as an
 analog to the stats component but seem to see the only way to
 accomplish this would be to create a custom collector that would
 accumulate and store this information (mean, std-dev etc) since the
 stats component only operates on indexed fields.

 Is there an easy way to tell Solr to use a custom collector without
 having to modify the SolrIndexSearcher class? Maybe is there an
 alternative way to get this information?

 Thanks
 Amit




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: need best solution for indexing and searching multiple, related database tables

2012-09-29 Thread Mikhail Khludnev
Fyi, block join query doesnt require denormalization, performant, but has
own limitations, of course. Many to many is the most painful point. I deal
with it but quite far from contributing generally applicable approach.
 29.09.2012 5:21 пользователь Biff Baxter tom.bren...@acmedata.net
написал:

 Hi Walter,

 I have bought into the denormalize approach.  My remaining questions are
 around how to
 construct the denormlized view and any solr functions that would support
 issues related
 to a) minimizing the denormalization explosion for 3 or more tables and b)
 handling many
 to many relationships.

 One issue I am concerned with is, if I search for IBM and Steve Jones in my
 example
 above, no records should be returned.  How do I manage that with the
 equivalent of the
 one denormalized record approach?

 I appreciate your help.

 Biff



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/need-best-solution-for-indexing-and-searching-multiple-related-database-tables-tp4009857p4011010.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Lifecycle of a TokenFilter from TokenFilterFactory

2012-10-01 Thread Mikhail Khludnev
It's not clear what you want to achieve. I don't always create custom
TokenStreams, but if I do I use Lucenes as a prototype to start from.

On Mon, Oct 1, 2012 at 6:07 PM, Em mailformailingli...@yahoo.de wrote:

 Hi Mikhail,

 thanks for your feedback.

 If so, how can I write UnitTests which respect the Reuse strategy?
 What's the recommended way when creating custom Tokenizers and
 TokenFilters?

 Kind regards,
 Em

 Am 01.10.2012 10:54, schrieb Mikhail Khludnev:
  Hello,
 
  Analyzers are reused. Analyzer is Tokenizer and several TokenFilters.
 Check
  the source org.apache.lucene.analysis.Analyzer, pay attention to
  reuseStrategy.
 
  Best regards
 
  On Sun, Sep 30, 2012 at 5:37 PM, Em mailformailingli...@yahoo.de
 wrote:
 
  Hello list,
 
  I saw a bug in a TokenFilter that only works, if there is a fresh
  instance created by the TokenFilterFactory and it seems as TokenFilters
  are reused some how for more than one request.
 
  So, if your TokenFilterFactory has a Logging-Statement in its
  create()-method, you see that log only now and again - but not on every
  request.
 
  Is this a bug in Solr 4.0-BETA or is this expected behaviour?
  If it is expected, what could be wrong with the TokenFilter?
 
  Kind regards,
  Em
 
 
 
 




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Lifecycle of a TokenFilter from TokenFilterFactory

2012-10-01 Thread Mikhail Khludnev
Ok. I might get what you are looking for. Extends SolrTestCase4J (see
plenty samples in codebase). Obtain request via req(), obtain schema from
it by getSchema(), then getAnalyzer() or getQueryAnalyzer() and ask for
analysis org.apache.lucene.analysis.Analyzer.tokenStream(String, Reader).
You'll find your filters cached in IndexSchema analyzers.

Let me know if it helps.

On Mon, Oct 1, 2012 at 10:54 PM, Em mailformailingli...@yahoo.de wrote:

 That's exactly the way I do it when I have to write some custom stuff.

 My problem is that I do not know how to integrate an Analyzer's
 reusability-feature into a Unit-Test to see what happens if - i.e. - a
 TokenFilter-instance is going to be reused.

 Some TokenFilter-prototypes I've seen are stateful and do not reset
 their state as neccessary in order to be reused. This problem only
 occurs when I deploy those Filters to Solr and index or search for some
 documents (which does not always calls create() on the
 TokenFilterFactory).  However I have to be able - at least somehow - to
 tackle those problems in Unit-Tests instead of noticing such problems
 after a deployment to Solr.

 So my question is:
 How can I (Unit-)test a TokenFilter with an Analyzer which reuses the
 same TokenFilter instance for more than one Input-TokenStream?

 Kind regards,
 Em

 Am 01.10.2012 19:43, schrieb Mikhail Khludnev:
  It's not clear what you want to achieve. I don't always create custom
  TokenStreams, but if I do I use Lucenes as a prototype to start from.
 
  On Mon, Oct 1, 2012 at 6:07 PM, Em mailformailingli...@yahoo.de wrote:
 
  Hi Mikhail,
 
  thanks for your feedback.
 
  If so, how can I write UnitTests which respect the Reuse strategy?
  What's the recommended way when creating custom Tokenizers and
  TokenFilters?
 
  Kind regards,
  Em
 
  Am 01.10.2012 10:54, schrieb Mikhail Khludnev:
  Hello,
 
  Analyzers are reused. Analyzer is Tokenizer and several TokenFilters.
  Check
  the source org.apache.lucene.analysis.Analyzer, pay attention to
  reuseStrategy.
 
  Best regards
 
  On Sun, Sep 30, 2012 at 5:37 PM, Em mailformailingli...@yahoo.de
  wrote:
 
  Hello list,
 
  I saw a bug in a TokenFilter that only works, if there is a fresh
  instance created by the TokenFilterFactory and it seems as
 TokenFilters
  are reused some how for more than one request.
 
  So, if your TokenFilterFactory has a Logging-Statement in its
  create()-method, you see that log only now and again - but not on
 every
  request.
 
  Is this a bug in Solr 4.0-BETA or is this expected behaviour?
  If it is expected, what could be wrong with the TokenFilter?
 
  Kind regards,
  Em
 
 
 
 
 
 
 
 




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Synonyms Phrase not working

2012-10-02 Thread Mikhail Khludnev
Gustav,

AFAIK, multi words synonyms is one of the weak points for Lucene/Solr. I'm
going to propose a solution approach at forthcoming Eurocon
http://www.apachecon.eu/schedule/presentation/18/ . You are welcome!



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Can I rely on correct handling of interrupted status of threads?

2012-10-02 Thread Mikhail Khludnev
I remember a bug in EmbeddedSolrServer at 1.4.1 when exception bypasses
request closing that lead to searcher leak and OOM. It was fixed about two
years ago.

On Tue, Oct 2, 2012 at 1:48 PM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 I'm using Solr 3.6.1 in an application embedded directly, i.e. via
 EmbeddedSolrServer, not over an HTTP connection, which works
 perfectly. Our application uses Thread.interrupt() for canceling
 long-running tasks (e.g. through Future.cancel). A while (and a few
 Solr versions) back a colleague of mine implemented a workaround
 because he said that Solr didn't handle the thread's interrupted
 status correctly, i.e. not setting the interrupted status after having
 caught an InterruptedException or rethrowing it, thus killing the
 information that an interrupt has been requested, which breaks
 libraries relying on that. However, I did not find anything up-to-date
 in mailing list or forum archives on the web. Is that still or was it
 ever the case? What does one have to watch out for when interrupting a
 thread that is doing anything within Solr/Lucene?

 Any advice would be appreciated.

 Regards,

 Robert




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Follow links in xml doc

2012-10-03 Thread Mikhail Khludnev
Billy,

Have you tied
http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer ?

On Wed, Oct 3, 2012 at 7:11 AM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Hi Billy,

 There is nothing in Solr that will do XML parsing and link extraction,
 so you'll need to do that part.  Once you do that have a look at Solr
 join for parent-child querying.

 http://search-lucene.com/?q=solr+join

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html
 Performance Monitoring - http://sematext.com/spm/index.html


 On Tue, Oct 2, 2012 at 9:51 PM, Billy Newman newman...@gmail.com wrote:
  Hello again all.
 
  I have a URLDataSource to index xml data.  Is there any way to follow
  links within the xml doc and index items in those under the same
  document?  I.E. if I search for a word or term and that term lives in
  a link of doc with ID 12345 I would like to return that doc when
  searched.
 
  Thanks,
  Billy




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Multi-Select Faceting with delimited field values

2012-10-04 Thread Mikhail Khludnev
The only way to do that is split your attributes, which are concatenations
of attr and val. you should have color attr with vals red, green, blue;
hdmi: yes/no; speaker: yes/no.
04.10.2012 5:19 пользователь Aaron Bains aaronba...@gmail.com написал:

 I am trying to set up my query for multi-select faceting, here is my
 attempt at the query:


 q=category:monitorsfq=attribute:(color-black)facet.field=attributefacet=true

 The undesired response from the query:

 response
 result name=response numFound=1 start=0
 doc
 str name=productid1019141675/str
 /doc
 /result
 lst name=facet_counts
 lst name=facet_queries/
 lst name=facet_fields
 lst name=attribute
  int name=color-black1/int
  int name=vga-yes1/int
  int name=hdmi-yes1/int
  int name=speakers-yes1/int
 /lst
 /lst
 lst name=facet_dates/
 lst name=facet_ranges/
 /lst
 /response




 The desired response:

 response
 result name=response numFound=1 start=0
 doc
 str name=productid1019141675/str
 /doc
 /result
 lst name=facet_counts
 lst name=facet_queries/
 lst name=facet_fields
 lst name=attribute
  int name=color-black120/int
  int name=color-silver58/int
  int name=color-white13/int
  int name=vga-yes1/int
  int name=hdmi-yes1/int
  int name=speakers-yes1/int
 /lst
 /lst
 lst name=facet_dates/
 lst name=facet_ranges/
 /lst
 /response


 The way I have the attribute and value delimited by a dash has me stumped
 on how to perform the tagging and excluding. If we exclude the entire
 attribute field with facet.field={!ex=dt}attribute it brings an undesired
 result. What I need to do is exclude (attribute:color)

 Thanks for the help!!



Re: Can I rely on correct handling of interrupted status of threads?

2012-10-04 Thread Mikhail Khludnev
it was another exception class.

On Thu, Oct 4, 2012 at 5:19 PM, Robert Krüger krue...@lesspain.de wrote:

 On Tue, Oct 2, 2012 at 8:50 PM, Mikhail Khludnev
 mkhlud...@griddynamics.com wrote:
  I remember a bug in EmbeddedSolrServer at 1.4.1 when exception bypasses
  request closing that lead to searcher leak and OOM. It was fixed about
 two
  years ago.
 
 You mean InterruptedException?




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Getting list of operators and terms for a query

2012-10-04 Thread Mikhail Khludnev
you've got ResponseBuilder as process() or prepare() argument, check
query field, but your component should be registered after QueryComponent
in your requestHandler config.

On Thu, Oct 4, 2012 at 6:03 PM, Davide Lorenzo Marino 
davide.mar...@gmail.com wrote:

 Hi All,
 i'm working in a new searchComponent that analyze the search queries.
 I need to know if given a query string is possible to get the list of
 operators and terms (better in polish notation)?
 I mean if the default field is country and the query is the String

 england OR (name:paul AND city:rome)

 to get the List

 [ Operator OR, Term country:england, OPERATOR AND, Term name:paul, Term
 city:rome ]

 Thanks in advance

 Davide Marino




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Problem with relating values in two multi value fields

2012-10-04 Thread Mikhail Khludnev
it's a typical nested document problem. there are several approaches. Out
of the box solution as far you need facets is
http://wiki.apache.org/solr/FieldCollapsing .

On Thu, Oct 4, 2012 at 7:19 PM, Torben Honigbaum 
torben.honigb...@neuland-bfi.de wrote:

 Hi Jack,

 thank you for your answer. The problem is, that I don't know the value for
 option A and that the values are numbers and I've to use the values as
 facet. So I need something like this:

 Docs:

 doc
   str name=id3/str
   str name=options
 strA/str
 strB/str
 ...
   str
   str name=value
 str200/str
 str400/str
 ...
   str
 /doc
 doc
   str name=id4/str
   str name=options
 strA/str
 strE/str
 ...
   str
   str name=value
 str300/str
 str400/str
 ...
   str
 /doc
 doc
   str name=id6/str
   str name=options
 strA/str
 strC/str
 ...
   str
   str name=value
 str200/str
 str400/str
 ...
   str
 /doc

 Query: …?q=options:A

 Facet: 200 (2), 300 (1)

 Thank you
 Torben

 Am 04.10.2012 um 17:10 schrieb Jack Krupansky:

  Use a field called option_value_pairs with values like A 200 and
 then query with a quoted phrase A 200.
 
  You could use a special character like equal sign instead of space:
 A=200 and then you don't have to quote it in the query.
 
  -- Jack Krupansky
 
  -Original Message- From: Torben Honigbaum
  Sent: Thursday, October 04, 2012 11:03 AM
  To: solr-user@lucene.apache.org
  Subject: Problem with relating values in two multi value fields
 
  Hello,
 
  I've a problem with relating values in two multi value fields. My
 documents look like this:
 
  doc
  str name=id3/str
  str name=options
strA/str
strB/str
strC/str
strD/str
  str
  str name=value
str200/str
str400/str
str240/str
str310/str
  str
  /doc
 
  My problem is that I've to search for a set of documents and display
 only the value for option A, for example, and use the value field as facet
 field. I need a result like this:
 
  doc
  str name=id3/str
  str name=optionsA/str
  str name=value200/str
  /doc
  facet …
 
  I think that this is a use case which isn't possible, right? So can
 someone show me an alternative way to solve this problem? The documents
 each have 500 options with 500 related values.
 
  Thank you
  Torben
 




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Identify exact search in edismax

2012-10-04 Thread Mikhail Khludnev
overall task is not clear to me, but if you want to field's all terms have
matched to user query i'd suggest to introduce your own Similarity:
 - write number of terms as a norm value (which is by default a byte per
doc per field), then
 - you'll be able to retrieve this number during search time and use for
evaluating your own mm - criteria.
WDYT?

On Thu, Oct 4, 2012 at 9:28 PM, rhl4tr rhl4...@gmail.com wrote:

 I am using edismax for guessing category from user query.

 If user says I want to buy BMW and Audi car. This query will be fed to
 edismax which will give me results based on phrase match.

 Field contains following values
 -BMW = Cars category
 -Audi = Cars
 -2 BHK = Real Estate
 -need job = jobs category
 -Buy 1Bhk - Apartments

 I get results with phrase matches on top.

 Generally top result will be a phrase match (if there are any). How can I
 know that field's all terms have matched to user query.

 e.g.
 mm = percentage of user query terms should match with field terms

 I want opposite = percentage of field values should match with user query.
 which is in my case 100% = phrase match





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Identify-exact-search-in-edismax-tp4011859.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Identify exact search in edismax

2012-10-05 Thread Mikhail Khludnev
absolutely, that's what I didn't get in your initial question. Okay it
seems you are talking about typical eCommerce search problem. I will speak
about it at http://www.apachecon.eu/schedule/presentation/18/ see you.

On Fri, Oct 5, 2012 at 9:47 AM, rhl4tr rhl4...@gmail.com wrote:

 But user query can contain any number of terms. I can not know how many
 fields term it has to match.

 {
   responseHeader:{
 status:0,
 QTime:1,
 params:{
   mm:0,
   sort:score desc,
   indent:true,
   qf:exact_keywords,
   wt:json,
   rows:1,
   defType:dismax,
   pf:exact_keywords,
   debugQuery:false,
   fl:data_id,data_name,exact_keywords,
   start:0,
   q:i want to by honda suzuki,
   fq:+data_type:pwords}},
   response:{numFound:2,start:0,docs:[
   {
 data_name:Cars ,
 data_id:71,
 exact_keywords:honda suzuki,
 term_mm:100%},
   {
 data_name:bikes ,
 data_id:72,
 exact_keywords:suzuki,
 term_mm:50%}
 ]
   }}

 An hypothetical solution would look like above json response.
 user_mm parameter will tell what percentage of terms has matched to user
 query.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Identify-exact-search-in-edismax-tp4011859p4011976.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: PriorityQueue:initialize consistently showing up as hot spot while profiling

2012-10-05 Thread Mikhail Khludnev
what's the value of rows param
http://wiki.apache.org/solr/CommonQueryParameters#rows ?

On Fri, Oct 5, 2012 at 6:56 AM, Aaron Daubman daub...@gmail.com wrote:

 Greetings,

 I've been seeing this call chain come up fairly frequently when
 debugging longer-QTime queries under Solr 3.6.1 but have not been able
 to understand from the code what is really going on - the call graph
 and code follow below.

 Would somebody please explain to me:
 1) Why this would show up frequently as a hotspot
 2) If it is expected to do so
 3) If there is anything I should look in to that may help performance
 where this frequently shows up as the long pole in the QTime tent
 4) What the code is doing and why heap is being allocated as an
 apparently giant object (which also is apparently not unheard of due
 to MAX_VALUE wrapping check)

 ---call-graph---
 Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time =
 487 ms)
  Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total
 time = 109 ms)
   org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms,
 total time = 109 ms)
org.apache.solr.handler.RequestHandlerBase:handleRequest:129
 (method time = 0 ms, total time = 109 ms)
 org.apache.solr.handler.component.SearchHandler:handleRequestBody:186
 (method time = 0 ms, total time = 109 ms)
  com.echonest.solr.component.EchoArtistGroupingComponent:process:188
 (method time = 0 ms, total time = 109 ms)
   org.apache.solr.search.SolrIndexSearcher:search:375 (method time
 = 0 ms, total time = 96 ms)
org.apache.solr.search.SolrIndexSearcher:getDocListC:1176
 (method time = 0 ms, total time = 96 ms)
 org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209
 (method time = 0 ms, total time = 96 ms)
  org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796
 (method time = 0 ms, total time = 26 ms)
   org.apache.solr.search.BitDocSet:andNot:185 (method time = 0
 ms, total time = 13 ms)
org.apache.lucene.util.OpenBitSet:clone:732 (method time =
 13 ms, total time = 13 ms)
   org.apache.solr.search.BitDocSet:intersection:31 (method
 time = 0 ms, total time = 13 ms)
org.apache.solr.search.DocSetBase:intersection:90 (method
 time = 0 ms, total time = 13 ms)
 org.apache.lucene.util.OpenBitSet:and:808 (method time =
 13 ms, total time = 13 ms)
  org.apache.lucene.search.TopFieldCollector:create:916 (method
 time = 0 ms, total time = 46 ms)
   org.apache.lucene.search.FieldValueHitQueue:create:175
 (method time = 0 ms, total time = 46 ms)

  
 org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111
 (method time = 0 ms, total time = 46 ms)
 org.apache.lucene.search.SortField:getComparator:409
 (method time = 0 ms, total time = 13 ms)

  org.apache.lucene.search.FieldComparator$FloatComparator:init:400
 (method time = 13 ms, total time = 13 ms)
 org.apache.lucene.util.PriorityQueue:initialize:108
 (method time = 33 ms, total time = 33 ms)
 ---snip---


 org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108:
 heap = (T[]) new Object[heapSize]; // T is unbounded type, so this
 unchecked cast works always

 ---PriorityQueue.java---
   /** Subclass constructors must call this. */
   @SuppressWarnings(unchecked)
   protected final void initialize(int maxSize) {
 size = 0;
 int heapSize;
 if (0 == maxSize)
   // We allocate 1 extra to avoid if statement in top()
   heapSize = 2;
 else {
   if (maxSize == Integer.MAX_VALUE) {
 // Don't wrap heapSize to -1, in this case, which
 // causes a confusing NegativeArraySizeException.
 // Note that very likely this will simply then hit
 // an OOME, but at least that's more indicative to
 // caller that this values is too big.  We don't +1
 // in this case, but it's very unlikely in practice
 // one will actually insert this many objects into
 // the PQ:
 heapSize = Integer.MAX_VALUE;
   } else {
 // NOTE: we add +1 because all access to heap is
 // 1-based not 0-based.  heap[0] is unused.
 heapSize = maxSize + 1;
   }
 }
 heap = (T[]) new Object[heapSize]; // T is unbounded type, so this
 unchecked cast works always
 this.maxSize = maxSize;

 // If sentinel objects are supported, populate the queue with them
 T sentinel = getSentinelObject();
 if (sentinel != null) {
   heap[1] = sentinel;
   for (int i = 2; i  heap.length; i++) {
 heap[i] = getSentinelObject();
   }
   size = maxSize;
 }
   }
 ---snip---


 Thanks, as always!
  Aaron




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Identify exact search in edismax

2012-10-05 Thread Mikhail Khludnev
I have only pencil scratches yet, can't share it. I can say that i've found
it quite close to approach described there
http://www.ulakha.com/publications.html it's called there Concept Search,
but as far as I understand I have rather different implementation approach.

On Fri, Oct 5, 2012 at 2:31 PM, rhl4tr rhl4...@gmail.com wrote:

 Can you please get me started. I can no wait till presentation.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Identify-exact-search-in-edismax-tp4011859p4012006.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: PriorityQueue:initialize consistently showing up as hot spot while profiling

2012-10-05 Thread Mikhail Khludnev
okay. huge rows value is no.1 way to kill Lucene. It's not possible,
absolutely. You need to rethink logic of your component. Check Solr's
FieldCollapsing code, IIRC it makes second search to achieve similar goal.
Also check PostFilter and DelegatingCollector classes, their approach can
also be handy for your task.

On Fri, Oct 5, 2012 at 2:38 PM, Aaron Daubman daub...@gmail.com wrote:

 On Fri, Oct 5, 2012 at 4:33 AM, Mikhail Khludnev
 mkhlud...@griddynamics.com wrote:
  what's the value of rows param
  http://wiki.apache.org/solr/CommonQueryParameters#rows ?

 Very interesting question - so, for historic reasons lost to me, we
 pass in a huge (1000?) number for rows and this hits our custom
 component, which has its own internal maximum for real rows returned.
 (This is a custom grouping component, so I am guessing the large
 number of rows had to do with trying not to limit what got grouped?).

 Is the value of rows what is used for that heap allocation?

absolutely. it's classic priority queue algorithm upon binary heap.



 Thanks,
  Aaron

 
  On Fri, Oct 5, 2012 at 6:56 AM, Aaron Daubman daub...@gmail.com wrote:
 
  Greetings,
 
  I've been seeing this call chain come up fairly frequently when
  debugging longer-QTime queries under Solr 3.6.1 but have not been able
  to understand from the code what is really going on - the call graph
  and code follow below.
 
  Would somebody please explain to me:
  1) Why this would show up frequently as a hotspot
  2) If it is expected to do so
  3) If there is anything I should look in to that may help performance
  where this frequently shows up as the long pole in the QTime tent
  4) What the code is doing and why heap is being allocated as an
  apparently giant object (which also is apparently not unheard of due
  to MAX_VALUE wrapping check)
 
  ---call-graph---
  Filter - SolrDispatchFilter:doFilter (method time = 12 ms, total time =
  487 ms)
   Filter - SolrDispatchFilter:execute:365 (method time = 0 ms, total
  time = 109 ms)
org.apache.solr.core.SolrCore:execute:1376 (method time = 0 ms,
  total time = 109 ms)
 org.apache.solr.handler.RequestHandlerBase:handleRequest:129
  (method time = 0 ms, total time = 109 ms)
 
 org.apache.solr.handler.component.SearchHandler:handleRequestBody:186
  (method time = 0 ms, total time = 109 ms)
   com.echonest.solr.component.EchoArtistGroupingComponent:process:188
  (method time = 0 ms, total time = 109 ms)
org.apache.solr.search.SolrIndexSearcher:search:375 (method time
  = 0 ms, total time = 96 ms)
 org.apache.solr.search.SolrIndexSearcher:getDocListC:1176
  (method time = 0 ms, total time = 96 ms)
  org.apache.solr.search.SolrIndexSearcher:getDocListNC:1209
  (method time = 0 ms, total time = 96 ms)
   org.apache.solr.search.SolrIndexSearcher:getProcessedFilter:796
  (method time = 0 ms, total time = 26 ms)
org.apache.solr.search.BitDocSet:andNot:185 (method time = 0
  ms, total time = 13 ms)
 org.apache.lucene.util.OpenBitSet:clone:732 (method time =
  13 ms, total time = 13 ms)
org.apache.solr.search.BitDocSet:intersection:31 (method
  time = 0 ms, total time = 13 ms)
 org.apache.solr.search.DocSetBase:intersection:90 (method
  time = 0 ms, total time = 13 ms)
  org.apache.lucene.util.OpenBitSet:and:808 (method time =
  13 ms, total time = 13 ms)
   org.apache.lucene.search.TopFieldCollector:create:916 (method
  time = 0 ms, total time = 46 ms)
org.apache.lucene.search.FieldValueHitQueue:create:175
  (method time = 0 ms, total time = 46 ms)
 
 
  
 org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue:init:111
  (method time = 0 ms, total time = 46 ms)
  org.apache.lucene.search.SortField:getComparator:409
  (method time = 0 ms, total time = 13 ms)
 
   org.apache.lucene.search.FieldComparator$FloatComparator:init:400
  (method time = 13 ms, total time = 13 ms)
  org.apache.lucene.util.PriorityQueue:initialize:108
  (method time = 33 ms, total time = 33 ms)
  ---snip---
 
 
  org.apache.lucene.util.PriorityQueue:initialize - hotspot is line 108:
  heap = (T[]) new Object[heapSize]; // T is unbounded type, so this
  unchecked cast works always
 
  ---PriorityQueue.java---
/** Subclass constructors must call this. */
@SuppressWarnings(unchecked)
protected final void initialize(int maxSize) {
  size = 0;
  int heapSize;
  if (0 == maxSize)
// We allocate 1 extra to avoid if statement in top()
heapSize = 2;
  else {
if (maxSize == Integer.MAX_VALUE) {
  // Don't wrap heapSize to -1, in this case, which
  // causes a confusing NegativeArraySizeException.
  // Note that very likely this will simply then hit
  // an OOME, but at least that's more indicative to
  // caller that this values is too big.  We don't +1
  // in this case

Re: Problem with relating values in two multi value fields

2012-10-05 Thread Mikhail Khludnev
denormalize your docs to option x value tuples, identify them by duping id.

doc
  str name=setid3/str
  str name=optionsA/str
  str name=value200/str
/doc
doc
  str name=setid3/str
  str name=optionsB/str
  str name=value400/str
/doc
doc
  str name=setid3/str
  str name=optionsB/str
  str name=value400/str
/doc
doc
  str name=setid3/str
  str name=optionsC/str
  str name=value240/str
/doc

then collapse them by set setid field. (it can not be uniqkey).

On Fri, Oct 5, 2012 at 6:26 PM, Torben Honigbaum 
torben.honigb...@neuland-bfi.de wrote:

 Hi Mikhail,

 I read the article and can't see how to solve my problem with
 FieldCollapsing.

 Any other suggestions?

 Torben

 Am 04.10.2012 um 17:31 schrieb Mikhail Khludnev:

  it's a typical nested document problem. there are several approaches. Out
  of the box solution as far you need facets is
  http://wiki.apache.org/solr/FieldCollapsing .
 
  On Thu, Oct 4, 2012 at 7:19 PM, Torben Honigbaum 
  torben.honigb...@neuland-bfi.de wrote:
 
  Hi Jack,
 
  thank you for your answer. The problem is, that I don't know the value
 for
  option A and that the values are numbers and I've to use the values as
  facet. So I need something like this:
 
  Docs:
 
  doc
   str name=id3/str
   str name=options
 strA/str
 strB/str
 ...
   str
   str name=value
 str200/str
 str400/str
 ...
   str
  /doc
  doc
   str name=id4/str
   str name=options
 strA/str
 strE/str
 ...
   str
   str name=value
 str300/str
 str400/str
 ...
   str
  /doc
  doc
   str name=id6/str
   str name=options
 strA/str
 strC/str
 ...
   str
   str name=value
 str200/str
 str400/str
 ...
   str
  /doc
 
  Query: …?q=options:A
 
  Facet: 200 (2), 300 (1)
 
  Thank you
  Torben
 
  Am 04.10.2012 um 17:10 schrieb Jack Krupansky:
 
  Use a field called option_value_pairs with values like A 200 and
  then query with a quoted phrase A 200.
 
  You could use a special character like equal sign instead of space:
  A=200 and then you don't have to quote it in the query.
 
  -- Jack Krupansky
 
  -Original Message- From: Torben Honigbaum
  Sent: Thursday, October 04, 2012 11:03 AM
  To: solr-user@lucene.apache.org
  Subject: Problem with relating values in two multi value fields
 
  Hello,
 
  I've a problem with relating values in two multi value fields. My
  documents look like this:
 
  doc
  str name=id3/str
  str name=options
   strA/str
   strB/str
   strC/str
   strD/str
  str
  str name=value
   str200/str
   str400/str
   str240/str
   str310/str
  str
  /doc
 
  My problem is that I've to search for a set of documents and display
  only the value for option A, for example, and use the value field as
 facet
  field. I need a result like this:
 
  doc
  str name=id3/str
  str name=optionsA/str
  str name=value200/str
  /doc
  facet …
 
  I think that this is a use case which isn't possible, right? So can
  someone show me an alternative way to solve this problem? The documents
  each have 500 options with 500 related values.
 
  Thank you
  Torben
 
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Need to update a field without re-indexing in solr 3.6

2012-10-05 Thread Mikhail Khludnev
Could you please tell me more. What field do you need to update, how it
influences the search results, how often, and why you can not afford commit?

On Fri, Oct 5, 2012 at 11:14 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 This is not doable in Solr 3.*.  There are Lucene-level patches in
 JIRA, but I'm not sure if they are in Solr 4.*

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html
 Performance Monitoring - http://sematext.com/spm/index.html


 On Fri, Oct 5, 2012 at 3:02 PM, Thakur, Pramila
 pramila_tha...@ontla.ola.org wrote:
  Hi Everyone,
 
  I am using Solr 3.6. I want to update a single filed value in the index
 without re-indexing. Is this possible?
  I have google and came across partial update in solr 4.0 BETA.
 
  Can I do do this with Solr 3.6?
 
  Thanks,
 
  -- Pramila Thakur
 
 




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Problem with relating values in two multi value fields

2012-10-06 Thread Mikhail Khludnev
Torben,

Denormalization implies copying attrs which are common for a group into the
smaller docs:

doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsA/str
  str name=value200/str
/doc
doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsB/str
  str name=value400/str
/doc
doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsB/str
  str name=value400/str
/doc
doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsC/str
  str name=value240/str
/doc

and use group.facet=true

On Sat, Oct 6, 2012 at 2:24 AM, Torben Honigbaum 
torben.honigb...@neuland-bfi.de wrote:

 Hi Mikhail,

 thank you for your answer. Maybe my sample data was a not so god. The
 document always have additional data which I need to use as facet like this:

 doc
   str name=id3/str
   str name=attribute_Avalue/str
   str name=attribute_Bvalue/str
   str name=options
 strA/str
 strB/str
 ...
   str
   str name=value
 str200/str
 str400/str
 ...
   str
 /doc

 Torben

 Am 05.10.2012 um 17:20 schrieb Mikhail Khludnev:

  denormalize your docs to option x value tuples, identify them by duping
 id.
 
  doc
   str name=setid3/str
   str name=optionsA/str
   str name=value200/str
  /doc
  doc
   str name=setid3/str
   str name=optionsB/str
   str name=value400/str
  /doc
  doc
   str name=setid3/str
   str name=optionsB/str
   str name=value400/str
  /doc
  doc
   str name=setid3/str
   str name=optionsC/str
   str name=value240/str
  /doc
 
  then collapse them by set setid field. (it can not be uniqkey).
 
  On Fri, Oct 5, 2012 at 6:26 PM, Torben Honigbaum 
  torben.honigb...@neuland-bfi.de wrote:
 
  Hi Mikhail,
 
  I read the article and can't see how to solve my problem with
  FieldCollapsing.
 
  Any other suggestions?
 
  Torben
 
  Am 04.10.2012 um 17:31 schrieb Mikhail Khludnev:
 
  it's a typical nested document problem. there are several approaches.
 Out
  of the box solution as far you need facets is
  http://wiki.apache.org/solr/FieldCollapsing .
 
  On Thu, Oct 4, 2012 at 7:19 PM, Torben Honigbaum 
  torben.honigb...@neuland-bfi.de wrote:
 
  Hi Jack,
 
  thank you for your answer. The problem is, that I don't know the value
  for
  option A and that the values are numbers and I've to use the values as
  facet. So I need something like this:
 
  Docs:
 
  doc
  str name=id3/str
  str name=options
strA/str
strB/str
...
  str
  str name=value
str200/str
str400/str
...
  str
  /doc
  doc
  str name=id4/str
  str name=options
strA/str
strE/str
...
  str
  str name=value
str300/str
str400/str
...
  str
  /doc
  doc
  str name=id6/str
  str name=options
strA/str
strC/str
...
  str
  str name=value
str200/str
str400/str
...
  str
  /doc
 
  Query: …?q=options:A
 
  Facet: 200 (2), 300 (1)
 
  Thank you
  Torben
 
  Am 04.10.2012 um 17:10 schrieb Jack Krupansky:
 
  Use a field called option_value_pairs with values like A 200 and
  then query with a quoted phrase A 200.
 
  You could use a special character like equal sign instead of space:
  A=200 and then you don't have to quote it in the query.
 
  -- Jack Krupansky
 
  -Original Message- From: Torben Honigbaum
  Sent: Thursday, October 04, 2012 11:03 AM
  To: solr-user@lucene.apache.org
  Subject: Problem with relating values in two multi value fields
 
  Hello,
 
  I've a problem with relating values in two multi value fields. My
  documents look like this:
 
  doc
  str name=id3/str
  str name=options
  strA/str
  strB/str
  strC/str
  strD/str
  str
  str name=value
  str200/str
  str400/str
  str240/str
  str310/str
  str
  /doc
 
  My problem is that I've to search for a set of documents and display
  only the value for option A, for example, and use the value field as
  facet
  field. I need a result like this:
 
  doc
  str name=id3/str
  str name=optionsA/str
  str name=value200/str
  /doc
  facet …
 
  I think that this is a use case which isn't possible, right? So can
  someone show me an alternative way to solve this problem? The
 documents
  each have 500 options with 500 related values.
 
  Thank you
  Torben
 
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Get report of keywords searched.

2012-10-07 Thread Mikhail Khludnev
Rajani,

IIRC solrmeter can grab search phrases from log. There is a special command
for doing it there. Right - Tool/Extract Queries.

Regards

On Sun, Oct 7, 2012 at 10:02 AM, Rajani Maski rajinima...@gmail.com wrote:

 Hi Davide,  Yes right. This can be done.

  Just one question, I am not sure if I had to create new thread for
 this question, Just wanted to know whether solrmeter or jmeter can help me
 get the keywords searched list? I am novice to solrmeter, just know that
 its used for stress test. Interested to know if I can use same tools for
 this case of getting keywords searhed list.


 Thanks
 Rajani

 On Fri, Oct 5, 2012 at 7:23 PM, Davide Lorenzo Marino 
 davide.mar...@gmail.com wrote:

  If you think this could be a problem for your performances you can try
 two
  different solutions:
 
  1 - Make the call to update the db in a different thread
  2 - Make an asynchronous http call to a web application that update the
 db
  (in this case the web app can be resident in a different machine, so the
  ram, cpu time and disk operations don't slow your solr engine)
 
 
  2012/10/5 Rajani Maski rajinima...@gmail.com
 
   Hi,
  
Thank you for the reply Davide.
  
  Writing to db you mean to insert into db the search queries? I was
   thinking that this might effect search performance?
   Yes you are right, Getting stats for particular key word is tough. It
  would
   suffice if I can get q param and fq param values( when we search using
   standard request handler).  Any open source solr log analysis tools?
 Can
  we
   achieve this with solrmeter? Has anyone tried with this?
  
   Thank You
  
  
  
  
   On Thu, Oct 4, 2012 at 2:07 PM, Davide Lorenzo Marino 
   davide.mar...@gmail.com wrote:
  
If you need to analyze the search queries is not very difficult, just
create a search plugin and put them in a db.
If you need to search the single keywords it is more difficult and
 you
   need
before starting to take some decision. In particular take the
 following
queries and try to answer how you would like to treat them for the
keywards:
   
1) apple OR orange
2) apple AND orange
3) title:apple AND subject:orange
4) apple -orange
5) apple OR (orange AND banana)
6) title:apple OR subject:orange
   
Ciao
   
Davide Marino
   
   
   
   
   
   
   
   
2012/10/3 Rajani Maski rajinima...@gmail.com
   
 Hi All,

I am using solrJ. When there is search query hit, I am logging
 the
   url
 in a location and also it is getting logged into tomcat catalina
  logs.
  Now I wanted to implement a functionality of periodically(per
 week)
 analyzing search logs of solr and find out the keywords searched.
 Is
there
 a way to do it using any of the existing functionality of solr? If
  not,
 Anybody has tried this implementation with any open source tools?
 Suggestions welcome. . Awaiting reply


 Thank you.

   
  
 




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Problem with relating values in two multi value fields

2012-10-08 Thread Mikhail Khludnev
Toke,
You are absolutely right, concatenating term is a possible solution. I
found faceting is quite complicated in this case, but it was a hot fix
which we delivered to production.

Torben,
This problem arise quite often, beside of these two approaches discussed
there, also possible to approach SpanQueries and TermPositions - you can
check our experience here:
http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html
http://vimeo.com/album/2012142/video/33817062
Our current way is BlockJoin which is really performant in case of batched
updates: http://blog.griddynamics.com/2012/08/block-join-query-performs.html.
Bad thing that there is no open facet component for block join. We
have a
code, but are not ready to share it, yet.

On Mon, Oct 8, 2012 at 12:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Mon, 2012-10-08 at 08:42 +0200, Torben Honigbaum wrote:
  sorry, my fault. This was one of my first ideas. My problem is, that
  I've 1.000.000 documents, each with about 20 attributes. Additionally
  each document has between 200 and 500 option-value pairs. So if I
  denormalize the data, it means that I've 1.000.000 x 350 (200 + 500 /
  2) = 350.000.000 documents, each with 20 attributes.

 If you have a few hundred or less distinct primary attributes (the A, B,
 C's in your example), you could create a new field for each of them:

 /doc
   str name=id3/str
   str name=optionsA B C D/str
   str name=option_A200/str
   str name=option_B400/str
   str name=option_C240/str
   str name=option_D310/str
   ...
   ...
 /doc

 Query for options:A and facet on field option_A to get facets for
 the specific field.

 This normalization does increase the index size due to duplicated
 secondary values between the option-fields, but since our assumption is
 a relatively small amount of primary values, it should not be too much.


 Alternatively, if you have many distinct primary attributes, index the
 pairs as Jack suggests:
 /doc
   str name=id3/str
   str name=optionsA B C D/str
   str name=optionA=200/str
   str name=optionB=400/str
   str name=optionC=240/str
   str name=optionD=310/str
   ...
   ...
 /doc

 Query for options:A and facet on field option with
 field.prefix=A=. Your result will be A=200 (2), A=450 (1)... so you'll
 have to strip whatever= before display.

 This normalization is potentially a lot heavier than the previous one,
 as we have distinct_primaries * distinct_secondaries distinct values.

 Worst case, where every document only contains distinct combinations of
 primary/secondary, we have 350M distinct option-values, which is quite
 heavy for a single box to facet on. Whether that is better or worse that
 350M documents, I don't know.

  Is denormalization the only way to handle this problem? I

 What you are trying to do does look quite a lot like hierarchical
 faceting, which Solr does not support directly. But even if you apply
 one of the experimental patches, it does not mitigate the potential
 combinatorial explosion of your primary  secondary values.

 So that leaves the question: How many distinct combinations of primary
 and secondary values do you have?

 Regards,
 Toke Eskildsen




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Reloading ExternalFileField blocks Solr

2012-10-08 Thread Mikhail Khludnev
Martin,

Can you tell me what's the content of that field, and how it should affect
search result?

On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote:

 Hi List

 We're using Solr-4.0.0-Beta with a 7M document index running on a single
 host with 16 shards. We'd like to use an ExternalFileField to hold a value
 that changes often. However, we've discovered that the file is apparently
 re-read by every shard/core on *every commit*; the index is unresponsive in
 this period (around 20s on the host we're running on). This is unacceptable
 for our needs. In the future, we'd like to add other values as
 ExternalFileFields, and this will make the problem worse.

 It would be better if the external file were instead read in in the
 background, updating previously read relevant values for each shard as they
 are read in.

 I guess a change in the ExternalFileField code would be required to achieve
 this, but I have no experience here, so suggestions are very welcome.

 Thanks,
 /Martin Koch - Issuu - Senior Systems Architect.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Reloading ExternalFileField blocks Solr

2012-10-08 Thread Mikhail Khludnev
Martin,

I have kind of hack approach in mind regarding hiding document from search.
So, it's a little bit easier than your task. I'm going to deliver talk
about it http://www.apachecon.eu/schedule/presentation/89/ .
Frankly speaking, there is no reliable out-of-the-box solution for it. I
saw that DocValues has been integrated with FunctionQueries already, but
DocValues updates, which sounds like doable thing, has not been delivered
yet.

Regards

On Mon, Oct 8, 2012 at 11:54 PM, Martin Koch m...@issuu.com wrote:

 Sure: We're boosting search results based on user actions which could be
 e.g. the number of times a particular document has been read. In future,
 we'd also like to boost by e.g. impressions (the number of times a document
 has been displayed) and other values.

 /Martin

 On Mon, Oct 8, 2012 at 7:02 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:

  Martin,
 
  Can you tell me what's the content of that field, and how it should
 affect
  search result?
 
  On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote:
 
   Hi List
  
   We're using Solr-4.0.0-Beta with a 7M document index running on a
 single
   host with 16 shards. We'd like to use an ExternalFileField to hold a
  value
   that changes often. However, we've discovered that the file is
 apparently
   re-read by every shard/core on *every commit*; the index is
 unresponsive
  in
   this period (around 20s on the host we're running on). This is
  unacceptable
   for our needs. In the future, we'd like to add other values as
   ExternalFileFields, and this will make the problem worse.
  
   It would be better if the external file were instead read in in the
   background, updating previously read relevant values for each shard as
  they
   are read in.
  
   I guess a change in the ExternalFileField code would be required to
  achieve
   this, but I have no experience here, so suggestions are very welcome.
  
   Thanks,
   /Martin Koch - Issuu - Senior Systems Architect.
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Reloading ExternalFileField blocks Solr

2012-10-12 Thread Mikhail Khludnev
Martin,

I found slide quite relevant to what are you asking about.

http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr


On Tue, Oct 9, 2012 at 7:57 AM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Hi Martin,

 Perhaps you could make a small change in Solr to add don't reload EFF
 if it hasn't been modified since it was last opened.  I assume you
 commit pretty often, but don't modify EFF files that often, so this
 could save you some needless loading.  That said, I'd be surprised EFF
 doesn't already do this... I didn't check.

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html
 Performance Monitoring - http://sematext.com/spm/index.html


 On Mon, Oct 8, 2012 at 4:55 AM, Martin Koch m...@issuu.com wrote:
  Hi List
 
  We're using Solr-4.0.0-Beta with a 7M document index running on a single
  host with 16 shards. We'd like to use an ExternalFileField to hold a
 value
  that changes often. However, we've discovered that the file is apparently
  re-read by every shard/core on *every commit*; the index is unresponsive
 in
  this period (around 20s on the host we're running on). This is
 unacceptable
  for our needs. In the future, we'd like to add other values as
  ExternalFileFields, and this will make the problem worse.
 
  It would be better if the external file were instead read in in the
  background, updating previously read relevant values for each shard as
 they
  are read in.
 
  I guess a change in the ExternalFileField code would be required to
 achieve
  this, but I have no experience here, so suggestions are very welcome.
 
  Thanks,
  /Martin Koch - Issuu - Senior Systems Architect.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Understanding Filter Queries

2012-10-20 Thread Mikhail Khludnev
Amit,

Sure. this method
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L796beside
some other stuff calculates fq's docset intersection which is
supplied into filtered search call
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1474

You are welcome.

On Sun, Oct 21, 2012 at 12:00 AM, Amit Nithian anith...@gmail.com wrote:

 Hi all,

 Quick question. I've been reading up on the filter query and how it's
 implemented and the multiple articles I see keep referring to this
 notion of leap frogging and filter query execution in parallel with
 the main query. Question: Can someone point me to the code that does
 this so I can better understand?

 Thanks!
 Amit




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Bitwise operation

2013-03-19 Thread Mikhail Khludnev
Christopher,

Would you mind if i ask you about a sample?
19.03.2013 19:31 пользователь Christopher ARZUR 
christopher.ar...@cognix-systems.com написал:

 Hi,

 Does solr (4.1.0) supports /bitwise/ AND or /bitwise/ OR operator so that
 we can specify a field to be compared against an index using /bitwise/ AND
 or OR ?

 Thanks,
 --
 Christopher



Re: customize solr search/scoring for performance

2013-03-23 Thread Mikhail Khludnev
Robert,

I also wonder why it always request to collect doclist in-order
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1469
Do you think it make sense to raise a JIRA to allow out of order
collecting?



On Tue, Nov 13, 2012 at 6:34 AM, Robert Muir rcm...@gmail.com wrote:

 Whenever I look at solr users' stacktraces for disjunctions, I always
 notice they get BooleanScorer2.

 Is there some reason for this or is it not intentional (e.g. maybe a
 in-order collector is always being used when its possible at least in
 simple cases to allow for out-of-order hits?)

 When I examine test contributions from clover reports (e.g.
 https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/),
 I notice that only lucene tests, and solr spellchecking tests actually
 hit BooleanScorer's collect. All other solr tests hit BooleanScorer2.

 If its possible to allow for an out of order collector in some common
 cases (e.g. large disjunctions w/ minShouldMatch generated by solr
 queryparsers), it could be a nice performance improvement.

 On Mon, Nov 12, 2012 at 3:48 PM, jchen2000 jchen...@yahoo.com wrote:
  The following was generated from jvisualvm. Seems like the perf is
 related to
  scoring a lot. Any idea/pointer on how to customize that part?
 
  http://lucene.472066.n3.nabble.com/file/n4019850/profilingResult.png
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019850.html
  Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Very slow query when boosting involve with EnternalFileField

2013-03-25 Thread Mikhail Khludnev
Floyd,

I think you need provide stack trace or draft sampling.


On Fri, Mar 22, 2013 at 6:23 AM, Floyd Wu floyd...@gmail.com wrote:

 Anybody can point me a direction?
 Many thanks.



 2013/3/20 Floyd Wu floyd...@gmail.com

  Hi everyone,
 
  I have a problem and have no luck to figure out.
 
  When I issue a query to
  Query 1
 
 
 http://localhost:8983/solr/select?q={!boost+b=recip(ms(NOW/HOUR,last_modified_datetime),3.16e-11,1,1)}all
 
 http://localhost:8983/solr/select?q=%7B!boost+b=recip(ms(NOW/HOUR,last_modified_datetime),3.16e-11,1,1)%7Dall
 
  :javastart=0rows=10fl=score,authorsort=score+desc
 
  Query 2
 
 
 http://localhost:8983/solr/select?q={!boost+b=sum(ranking,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)}all
 
 http://localhost:8983/solr/select?q=%7B!boost+b=sum(ranking,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)%7Dall
 
  :javastart=0rows=10fl=score,authorsort=score+desc
 
  The difference between two query is boost.
  The boost function of Query 2 using a field named ranking and this field
  is ExternalFileField.
  External file is key=value pair about 1 lines.
 
  Execution time
  Query 1--100ms
  Query 2--2300ms
 
  I tried to issue Query 3 and change ranking to a constant 1
 
 
 http://localhost:8983/solr/select?q={!boost+b=sum(1,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)}all
 
 http://localhost:8983/solr/select?q=%7B!boost+b=sum(1,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)%7Dall
 
  :javastart=0rows=10fl=score,authorsort=score+desc
 
  Execution time
  Query 3--110ms
 
  one thing I can sure that involved with externalFileField will slow down
  query execution time significantly. But I have no idea how to solve this
  problem as my boost function must calculate value of ranking field.
 
  Please help on this.
 
  PS: I'm using SOLR-4.1
 
  Floyd
 
 
 
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Is any way to return the number of indexed tokens in a field?

2013-04-14 Thread Mikhail Khludnev
Alex,

It's not what do you need to count, pre-analyzed values or tokens as an
analysis result.
if former, I suggest you to look into something like
FieldLengthUpdateProcessorFactory, in case of later you need to override
Similarity.computeNorm(String, FieldInvertState) / encode/decodeNorm.



On Sun, Apr 14, 2013 at 8:29 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Hello,

 We seem to have all sorts of functions around tokenized field content, but
 I am looking for simple count/length that can be returned as a
 pseudo-field. Does anyone know of one out of the box?

 The specific situation is that I am indexing a field for specific regular
 expressions that become tokens (in a copyField). Not every field has the
 same number of those.

 I now want to find the documents that have maximum number of tokens in that
 field (for testing and review). But I can't figure out how.  Any help would
 be appreciated.

 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Questions about the performance of Solr

2013-05-06 Thread Mikhail Khludnev
Hello,

start from http://wiki.apache.org/solr/CommonQueryParameters#fq




On Mon, May 6, 2013 at 11:42 AM, joo jamodr...@nate.com wrote:

 Search speed at which data is loaded is more than 7 ten millon current will
 be reduced too.
 About 50 seconds it will take, but the number is often just this, it is not
 possible to know whether such.
 Will there is a problem with the Query I use it to know the Query
 Optimizing
 Solr and fall.
 The Query, for example I use,
 time: [time to time] AND category: (1,2) AND (message1: message OR
 message2:
 message)
 I try to this.
 As long as there is no this problem, you need advice please do take a look
 at which part.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Questions-about-the-performance-of-Solr-tp4060988.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: maximum number of simultaneous threads

2013-05-14 Thread Mikhail Khludnev
Venkata,

Solr is a neat webapp. It doesn't spins threads (almost). It's spin in
servlet container threads. You need to configure tomcat/jetty.

On Tue, May 14, 2013 at 4:17 PM, Dmitry Kan solrexp...@gmail.com wrote:

 venkata,

 If you are after search scaling, then the webapp server (like tomcat, jetty
 etc) handles allocation of threads per client connection (maxThreads for
 jetty for instance). Inside one client request SOLR uses threads for
 various tasks, but I don't have any exact figures (not sure if wiki has
 them either).

 Dmitry


 On Mon, May 13, 2013 at 7:22 PM, venkata vmarr...@yahoo.com wrote:

 
 
 
 
 
  I am seeing  configuration point for indexing threads.
 
  However I am not finding anything for search.   How many simultaneous
  threads, SOLR can spin during search time?
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/maximum-number-of-simultaneous-threads-tp4062903p4062982.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Sorting facets by relevance

2013-05-15 Thread Mikhail Khludnev
There is Lucene faceting module, which doesn't do anything in common with
Solr, but it looks like it has something what you are looking for.
http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html


On Thu, May 16, 2013 at 1:33 AM, Jan Morlock jan.morl...@googlemail.comwrote:

 Hi,

 we are using faceted search for our queries. However neither sorting by
 count nor sorting by index as described in [1] is suitable for our business
 case. Instead, we would like to have the facets (or at least the beginning
 of them) sorted by the score of the top document possessing the
 corresponding facet. The expected behaviour can be compared to what the
 result grouping feature does (see [2]).

 I am currently thinking about the following strategy:
 (1) Create a new search component
 (2) Perform a sub-query using grouping
 (3) Use the result of this sub-query in order to sort the facets of the
 actual query.

 Currently step no. 2 seems to be pretty difficult. Can anybody point a me
 to
 an example, where a sub-query is performed in order to retrieve the groups?
 Or does anybody have a better/easier strategy for achieving this?

 Any help is appreciated.
 Thank you very much in advance.

 Best regards
 Jan

 [1]: http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort
 [2]: http://wiki.apache.org/solr/FieldCollapsing



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Sorting-facets-by-relevance-tp4063649.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Facet pivot 50.000.000 different values

2013-05-17 Thread Mikhail Khludnev
On Fri, May 17, 2013 at 12:47 PM, Carlos Bonilla
carlosbonill...@gmail.comwrote:

 We
 only need to calculate how many different B values have more than 1
 document but it takes ages


Carlos,
It's not clear whether you need to take results of a query into account or
just gather statistics from index. if later you can just enumerate terms
and watch into TermsEnum.docFreq() . Am I getting it right?


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: intersection of filter queries with raw query parser

2013-06-02 Thread Mikhail Khludnev
Hello Sascha,

I propose to call raw parser from standard one by nested query syntax
http://searchhub.org/2009/03/31/nested-queries-in-solr/

Regards.


On Fri, May 31, 2013 at 3:35 PM, Sascha Szott sz...@zib.de wrote:

 Hi folks,

 is it possible to use the raw query parser with a disjunctive filter
 query? Say, I have a field 'foo' and two values 'v1' and 'v2' (the field
 values are free text and can contain any character). What I want is to
 retrieve all documents satisying fq=foo:(v1 OR v2). In case only one field
 (v1) is given, the query fq={!raw f=foo}v1 works as expected. But how can I
 formulate the filter query (with the raw query parser) in case two values
 are provided.

 The same question was posted on Stackoverflow (http://stackoverflow.com/**
 questions/5637675/solr-query-**with-raw-data-and-union-**
 multiple-facet-valueshttp://stackoverflow.com/questions/5637675/solr-query-with-raw-data-and-union-multiple-facet-values)
 two years ago. But there was only the advice to give up using the raw query
 parser which is not what I want to do.

 Thanks in advance,
 Sascha




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-05 Thread Mikhail Khludnev
Please excuse my misunderstanding, but I always wonder why this index time
processing is suggested usually. from my POV is the case for query-time
processing i.e. PrefixQuery aka wildcard query Jason* .
Ultra-fast term retrieval also provided by TermsComponent.


On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky j...@basetechnology.comwrote:

 ngrams?

 See:
 http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
 apache/lucene/analysis/ngram/**NGramFilterFactory.htmlhttp://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html

 -- Jack Krupansky

 -Original Message- From: Prathik Puthran
 Sent: Wednesday, June 05, 2013 11:59 AM
 To: solr-user@lucene.apache.org
 Subject: Configuring lucene to suggest the indexed string for all the
 searches of the substring of the indexed string


 Hi,

 Is it possible to configure solr to suggest the indexed string for all the
 searches of the substring of the string?

 Thanks,
 Prathik




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Mikhail Khludnev
Got it. It's actually contrast to usual prefix suggestions.
So, out-of-the box it's provided by
http://wiki.apache.org/solr/TermsComponent terms.regex= also see last
example there
it should works by loading terms in memory and linearly scanning them with
regexp.
There is nothing more efficient out-of-the box.
http://wiki.apache.org/solr/Suggester says Support for infix-suggestions
_is planned_ for FSTLookup (which would be the only structure to support
these).


On Thu, Jun 6, 2013 at 10:25 AM, Prathik Puthran 
prathik.puthra...@gmail.com wrote:

 My use case is I want to search for any substring of the indexed string and
 the Suggester should suggest the indexed string. What can I do to make this
 work?

 Thanks,
 Prathik


 On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:

  Please excuse my misunderstanding, but I always wonder why this index
 time
  processing is suggested usually. from my POV is the case for query-time
  processing i.e. PrefixQuery aka wildcard query Jason* .
  Ultra-fast term retrieval also provided by TermsComponent.
 
 
  On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky j...@basetechnology.com
  wrote:
 
   ngrams?
  
   See:
   http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
   apache/lucene/analysis/ngram/**NGramFilterFactory.html
 
 http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html
  
  
   -- Jack Krupansky
  
   -Original Message- From: Prathik Puthran
   Sent: Wednesday, June 05, 2013 11:59 AM
   To: solr-user@lucene.apache.org
   Subject: Configuring lucene to suggest the indexed string for all the
   searches of the substring of the indexed string
  
  
   Hi,
  
   Is it possible to configure solr to suggest the indexed string for all
  the
   searches of the substring of the string?
  
   Thanks,
   Prathik
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: The 'threads' parameter in DIH - SOLR 4.3.0

2013-06-14 Thread Mikhail Khludnev
Hello,

Most times users end-up with coding multithread SolrJ indexer that I
consider as a sad thing. As 3.x fix contributor I want to share my vision
to the problem. While I did that work I realized that join operation itself
is too hard and even impossible to make concurrent. I propose to add
concurrency into outbound and inbound streams.

My plan is:
1. add threads to outbound flow
https://issues.apache.org/jira/browse/SOLR-3585 it allows to don't wait for
Solr. I mostly like that code, but recently I realized that this code
implements ConcurrentUpdateSolrServer algorithm, looking forward I prefer
to unify some core concurrent code between them or it's kind of using CUSS
inside of DIH's SolrWriter
2. The next problem, which we've faced is SQLEntityProcessor. It has two
modes, one of them gets miserable performance due to N+1 problem; cached
version is not production capable with default heap cache.  Our proposal
for it https://issues.apache.org/jira/browse/SOLR-4799 unfortunately I have
no time to polish the patch.
3. After that the only thing which DIH  waits for is jdbc. it can be easily
boosted by implementing DataSource wrapper with producer thread and bounded
queue as a buffer.

if we complete this plan, we will never need to code SolrJ indexers.

Particular question to you is what you need to speed up?

On Thu, Jun 13, 2013 at 11:01 PM, Shawn Heisey s...@elyograg.org wrote:

 On 6/13/2013 12:08 PM, bbarani wrote:

 I see that the threads parameter has been removed from DIH from all
 version
 starting SOLR 4.x. Can someone let me know the best way to initiate
 indexing
 in multi threaded mode when using DIH now? Is there a way to do that?


 That parameter was removed because it didn't work right, and there was no
 apparent way to fix it.  The change that went into a later 3.6 version was
 a bandaid, not a fix.  I don't know all the details.

 There's no way to get multithreading with DIH directly, but you can do it
 indirectly:

 Create multiple request handlers with different names, such as
 /dataimport1, /dataimport2, etc.  Configure each handler with settings that
 will pull part of your data source.  Start them so they run concurrently.

 Depending on your environment, it may be easier to just write a
 multi-threaded indexing application using the Solr API for your language of
 choice.

 Thanks,
 Shawn




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 mkhlud...@griddynamics.com


Re: New operator.

2013-06-16 Thread Mikhail Khludnev
Hello Yanis,

Two options.
1. Create own SearchComponent, which adds filterQuery into request, and add
it into SearchHandler. http://wiki.apache.org/solr/SearchComponent
2. Create QParserPlugin and call them by request param
...fq={!yanisqp}applyvector...
http://wiki.apache.org/solr/SolrPlugins#QParserPlugin


On Sun, Jun 16, 2013 at 10:01 AM, Yanis Kakamaikis 
yanis.kakamai...@gmail.com wrote:

 Hi all,I want to add a new operator to my solr.   I need that operator
 to call my proprietary engine and build an answer vector to solr, in a way
 that this vector will be part of the boolean query at the next step.   How
 do I do that?
 Thanks




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr large boolean filter

2013-06-16 Thread Mikhail Khludnev
Right.
FieldCacheTermsFilter is an option. You need to create own QParserPlugin
which yields FieldCacheTermsFilter, hook him as ..fq={!idsqp
cache=false}..
Mind disabling caching! Mind term ecoding due to field type!

I also suggest to check how much it spend for tokenization. Once a day I've
got some profit by using efficient encoding for this param (try fixed
length or vint)

There is a one more gain when the core query is highly selective and id
filter is weakly selective, in  this case using explicit PostFiltering
(what a hack btw) is desired. see
http://yonik.com/posts/advanced-filter-caching-in-solr/

From my experience the proper solution for such problems is moving to one
of the joins or ExternalFileField.




On Sun, Jun 16, 2013 at 2:49 AM, Igor Kustov ivkus...@gmail.com wrote:

 I know i'm not the first one with this problem.

 I'm currently using solr 4.2.1 with approximately 10 mln documents in the
 index.

 The index is updated frequently.

 The filter_query is just a one big boolean or query by id.

 fq=id:(1 2 3 4 ... 50950)

 ids list is always different and not sequential.

 The problem is that query performance not so well, as you can imagine.

 In some particular cases i'm able to do filtering based on different
 fields,
 but in some cases (like 30-40% of all queries) i'm still end up with this
 large id filter.

 I'm looking for the ways to improve this query performance.

 It doesn't seem like solr join could be applied there.

 Another option that I found is to somehow use Lucene FieldCacheTermsFilter.
 Does it worth a try?

 Maybe i've missed some other options?





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr large boolean filter

2013-06-17 Thread Mikhail Khludnev
nonono, mate! I warn you before by 'Mind term ecoding due to field type!'

you need to obtain schema from request, then access fieldtype and convert
external string representation into (might be) tricky encoded bytes by
readableToIndexed() see FieldType.getFieldQuery()

btw, it's a really frequent pain in this list, feel free to contribute when
you done!

Empty BooleanQuery matches nothing. There is a  MatchAllDocsQuery().

On Mon, Jun 17, 2013 at 8:35 PM, Igor Kustov ivkus...@gmail.com wrote:


 Menawhile I'm currently trying to write custom QParser which will use
 FieldCacheTermsFilter

 So I'm using query like
 http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29

 And I couldn't make it work - I just couldn't find a proper constructor and
 also not sure that i'm filtering appropriately.

 private class MyQParser {

 ListString idsList;

 MyQParser(String queryString, SolrParams localParams, SolrParams
 solrParams,
 SolrQueryRequestsolrQueryRequest) throws SyntaxError {
 super(queryString,localParams,solrParams, solrQueryRequest);
  idsList = // extract ids from params
 }

 @Override
 public Query parse() throws SyntaxError {
FieldCacheTerms filter = new
 FieldCacheTermsFilter(id,idsList.toArray())
 // first problem id is just an int in my case, but this seems like the only
 normal constructor
return new FilteredQuery(new BooleanQuery(), filter);
 // my goal here is to get only filtered data, but does BooleanQuery()
 equals
 to *:*?
 }







 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747p4071049.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Converting nested data model to solr schema

2013-07-01 Thread Mikhail Khludnev
On Mon, Jul 1, 2013 at 5:56 PM, adfel70 adfe...@gmail.com wrote:

 This requires me to override the solr document distribution mechanism.
 I fear that with this solution I may loose some of solr cloud's
 capabilities.


It's not clear whether you aware of
http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you
did doesn't sound scary to me. If it works, it should be fine. I'm not
aware of any capabilities that you are going to loose.
Obviously SOLR-3076 provides astonishing query time performance, with
offloading actual join work into index time. Check it if you current
approach turns slow.


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Schema design for parent child field

2013-07-01 Thread Mikhail Khludnev
from my experience deeply nested scopes is for SOLR-3076 almost only.


On Sat, Jun 29, 2013 at 1:08 PM, Sperrink
kevin.sperr...@lexisnexis.co.zawrote:

 Good day,
 I'm seeking some guidance on how best to represent the following data
 within
 a solr schema.
 I have a list of subjects which are detailed to n levels.
 Each document can contain many of these subject entities.
 As I see it if this had been just 1 subject per document, dynamic fields
 would have been a good resolution.
 Any suggestions on how best to create this structure in a denormalised
 fashion while maintaining the data integrity.
 For example a document could have:
 Subject level 1: contract
 Subject level 2: claims
 Subject level 1: patent
 Subject level 2: counter claims

 If I were to search for level 1 contract, I would only want the facet count
 for level 2 to contain claims and not counter claims.

 Any assistance in this would be much appreciated.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Schema-design-for-parent-child-field-tp4074084.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr large boolean filter

2013-07-02 Thread Mikhail Khludnev
 Rafalovitch
   arafa...@gmail.com wrote:
   On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com
  wrote:
   So I'm using query like
  
 
 http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29
 http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29
 
  
   If the IDs are purely numeric, I wonder if the better way is to
 send
  a
   bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if
  ID:2000
   is included. Even using URL-encoding rules, you can fit at least 65
   sequential ID flags per character and I am sure there are more
   efficient encoding schemes for long empty sequences.
  
   Regards,
  Alex.
  
  
  
   Personal website: http://www.outerthoughts.com/
   LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
   - Time is the quality of nature that keeps events from happening
 all
   at once. Lately, it doesn't seem to be working.  (Anonymous  - via
  GTD
   book)
 
 
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: set-based and other less common approaches to search

2013-07-02 Thread Mikhail Khludnev
try to hit dismax query parser specifying mm and qf parameters.


On Tue, Jul 2, 2013 at 9:31 PM, gilawem mewa...@gmail.com wrote:

 Thanks. So following up on a) below, could I set up and query Solr,
 without any customization of code, to match 10 of my given 20 terms, but
 only if it finds those 10 terms in an xls document under a column that is
 named MyID or My ID or My I.D.? If so, what would that query look
 like?

 On Jul 2, 2013, at 12:38 PM, Otis Gospodnetic wrote:

  Hi,
 
  Solr can do all of these.  There are phrase queries, queries where you
  specify a field, the mm param for min should match, etc.
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
 
  On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote:
  Let's say I wanted to ask solr to find me any document that contains at
 least 100 out of some 300 search terms I give it. Can Solr do this out of
 the box? If not, what kind of customization would it require?
 
  Now let's say I want to further have the option to request that those
 terms a) must show up within the same column of an excel spreadsheet, or b)
 are exact matches (i.e. match on search, but not searched), or c) occur
 in the exact order that I specified, or d) occur contiguously and without
 any words in between, or e) are made up of non-word elements such as
 92228345 or SJA12334.
 
  Can solr do any of these out of the box? If not, what of these tasks is
 relatively easy to do with some custom code, and what is not?




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Converting nested data model to solr schema

2013-07-02 Thread Mikhail Khludnev
during indexing whole block (doc and it's attachment) goes into particular
shard, then it's can be queried per every shard and results are merged.

btw, do you feel any problem with your current approach - query time joins
and out-of-the-box shard routing?


On Tue, Jul 2, 2013 at 5:19 PM, adfel70 adfe...@gmail.com wrote:

 I'm not familiar with block join in lucene. I've read a bit, and I just
 want
 to make sure - do you think that when this ticket is released, it will
 solve
 the current problem of solr cloud joins?

 Also, can you elaborate a bit about your solution?


 Jack Krupansky-2 wrote
  It sounds like 4.4 will have an RC next week, so the prospects for block
  join in 4.4 are kind of dim. I mean, such a significant feature should
  have
  more than a few days to bake before getting released. But... who knows
  what
  Yonik has planned!
 
  -- Jack Krupansky
 
  -Original Message-
  From: adfel70
  Sent: Tuesday, July 02, 2013 7:41 AM
  To:

  solr-user@.apache

  Subject: Re: Converting nested data model to solr schema
 
  As you see it, does SOLR-3076 fixes my problem?
 
  Is SOLR-3076 fix getting into solr 4.4?
 
 
  Mikhail Khludnev wrote
  On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt;
 
  adfel70@
 
  gt; wrote:
 
  This requires me to override the solr document distribution mechanism.
  I fear that with this solution I may loose some of solr cloud's
  capabilities.
 
 
  It's not clear whether you aware of
  http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what
  you
  did doesn't sound scary to me. If it works, it should be fine. I'm not
  aware of any capabilities that you are going to loose.
  Obviously SOLR-3076 provides astonishing query time performance, with
  offloading actual join work into index time. Check it if you current
  approach turns slow.
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  lt;http://www.griddynamics.comgt;
   lt;
 
  mkhludnev@
 
  gt;
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html
  Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr large boolean filter

2013-07-02 Thread Mikhail Khludnev
Roman,

It's covered in http://wiki.apache.org/solr/ContentStream
 | For POST requests where the content-type is not
application/x-www-form-urlencoded, the raw POST body is passed as a
stream.

So, there is no need for encoding of binary data inside the body.

Regarding encoding, I have a positive experience of passing such ids
encoded by vInt, but they need to be presorted.



On Tue, Jul 2, 2013 at 10:46 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello Mikhail,

 Yes, GET is limited, but POST is not - so I just wanted that it works in
 both the same way. But I am not sure if I am understanding your question
 completely. Could you elaborate on the parameters/body part? Is there no
 need for encoding of binary data inside the body? Or do you mean it is
 treated as a string? Or is it just a bytestream and other parameters are
 seen as string?

 On a general note: my main concern was to send many ids fast, if we use
 ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb
 check numbers please :)). But certainly, if the bitset is sparse or the
 collection of ids just a 'a few thousands', stream of ints/longs will be
 smaller, better to use.

 roman



 On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:

  Hello Roman,
 
  Don't you consider to pass long id sequence as body and access internally
  in solr as a content stream? It makes base64 compression not necessary.
  AFAIK url length is limited somehow, anyway.
 
 
  On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   Wrong link to the parser, should be:
  
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java
  
  
   On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
Hello @,
   
This thread 'kicked' me into finishing som long-past task of
sending/receiving large boolean (bitset) filter. We have been using
   bitsets
with solr before, but now I sat down and wrote it as a qparser. The
 use
cases, as you have discussed are:
   
 - necessity to send lng list of ids as a query (where it is not
possible to do it the 'normal' way)
 - or filtering ACLs
   
   
It works in the following way:
   
  - external application constructs bitset and sends it as a query to
   solr
(q or fq, depends on your needs)
  - solr unpacks the bitset (translated bits into lucene ids, if
necessary), and wraps this into a query which then has the easy job
 of
'filtering' wanted/unwanted items
   
Therefore it is good only if you can search against something that is
indexed as integer (id's often are).
   
A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms
 (25+64+20)
   
To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)
   
But I haven't tested latency of sending it over the network and the
  query
performance, but since the query is very similar as MatchAllDocs, it
 is
probably very fast (and I know that sending many Mbs to Solr is fast
 as
well)
   
I know this is not exactly 'standard' solution, and it is probably
 not
something you want to see with hundreds of millions of docs, but
 people
seem to be doing 'not the right thing' all the time;)
So if you think this is something useful for the community, please
 let
  me
know. If somebody would be willing to test it, i can file a JIRA
  ticket.
   
Thanks!
   
Roman
   
   
The code, if no JIRA is needed, can be found here:
   
   
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
   
   
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java
   
839ms.  run
154ms.  Building random bitset indexSize=1000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=999
 25ms.  Converting bitset to byte array -- resulting array
  length=125
20ms.  Encoding byte array into base64 -- resulting array
  length=168
ratio=1.344
 62ms.  Compressing byte array with GZIP -- resulting array
length=1218602 ratio=0.9748816
20ms.  Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
 5ms.  Decoding gzipped byte array from base64
14ms.  Uncompressing decoded byte array
68ms.  Converting from byte array to bitset
 743ms.  running
   
   
On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
   
Not necessarily. If the auth tokens are available on some
other system (DB, LDAP, whatever), one could get them
in the PostFilter and cache them somewhere since,
presumably, they wouldn't be changing all that often. Or
use

Re: What are the options for obtaining IDF at interactive speeds?

2013-07-03 Thread Mikhail Khludnev
Katie,

This case is actually really hard to get. Just let me provide the
contra-sample, to let you explain problem better by spotting the gap.
What if I say that, debugQuery=true provides tf, idf for the terms and
documents from the requested page of results. Why you can't use explain to
solve the problem?


On Wed, Jul 3, 2013 at 1:06 AM, Kathryn Mazaitis
kathryn.riv...@gmail.comwrote:

 Hi,

 I'm using SOLRJ to run a query, with the goal of obtaining:

 (1) the retrieved documents,
 (2) the TF of each term in each document,
 (3) the IDF of each term in the set of retrieved documents (TF/IDF would be
 fine too)

 ...all at interactive speeds, or 10s per query. This is a demo, so if all
 else fails I can adjust the corpus, but I'd rather, y'know, actually do it.

 (1) and (2) are working; I completed the patch posted in the following
 issue:
 https://issues.apache.org/jira/browse/SOLR-949
 and am just setting tv=truetv.tf=true for my query. This way I get the
 documents and the tf information all in one go.

 With (3) I'm running into trouble. I have found 2 ways to do it so far:

 Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
 information along with the documents and tf information. Since each term
 may appear in multiple documents, this means retrieving idf information for
 each term about 20 times, and takes over a minute to do.

 Option B: After I've gathered the tf information, run through the list of
 terms used across the set of retrieved documents, and for each term, run a
 query like:
 {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
 ...while this retrieves idf information only once for each term, the added
 latency for doing that many queries piles up to almost two minutes on my
 current corpus.

 Is there anything I didn't think of -- a way to construct a query to get
 idf information for a set of terms all in one go, outside the bounds of
 what terms happen to be in a document?

 Failing that, does anyone have a sense for how far I'd have to scale down a
 corpus to approach interactive speeds, if I want this sort of data?

 Katie




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Performance of cross join vs block join

2013-07-11 Thread Mikhail Khludnev
Mihaela,

For me it's reasonable that single core join takes the same time as cross
core one. I just can't see which gain can be obtained from in the former
case.
I hardly able to comment join code, I looked into, it's not trivial, at
least. With block join it doesn't need to obtain parentId term
values/numbers and lookup parents by them. Both of these actions are
expensive. Also blockjoin works as an iterator, but join need to allocate
memory for parents bitset and populate it out of order that impacts
scalability.
Also in None scoring mode BJQ don't need to walk through all children, but
only hits first. Also, nice feature is 'both side leapfrog' if you have a
highly restrictive filter/query intersects with BJQ, it allows to skip many
parents and children as well, that's not possible in Join, which has fairly
'full-scan' nature.
Main performance factor for Join is number of child docs.
I'm not sure I got all your questions, please specify them in more details,
if something is still unclear.
have you saw my benchmark
http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?



On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.comwrote:

 Hello,

 Does anyone know about some measurements in terms of performance for cross
 joins compared to joins inside a single index?

 Is it faster the join inside a single index that stores all documents of
 various types (from parent table or from children tables)with a
 discriminator field compared to the cross join (basically in this case each
 document type resides in its own index)?

 I have performed some tests but to me it seems that having a join in a
 single index (bigger index) does not add too much speed improvements
 compared to cross joins.

 Why a block join would be faster than a cross join if this is the case?
 What are the variables that count when trying to improve the query
 execution time?

 Thanks!
 Mihaela




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: How to make 'fq' optional?

2013-07-11 Thread Mikhail Khludnev
https://lucene.apache.org/solr/4_2_0/solr-core/org/apache/solr/search/SwitchQParserPlugin.html

Hoss cares about you!


On Wed, Jul 10, 2013 at 10:40 PM, Learner bbar...@gmail.com wrote:

 I am trying to make a variable in fq optional,

 Ex:

 /select?first_name=peterfq=$first_nameq=*:*

 I don't want the above query to throw error or die whenever the variable
 first_name is not passed to the query instead return the value
 corresponding
 to rest of the query. I can use switch but its difficult to handle each and
 every case using switch (as I need to handle switch for so many
 variables)... Is there a way to resolve this via some other way?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-make-fq-optional-tp4077042.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Performance of cross join vs block join

2013-07-12 Thread Mikhail Khludnev
On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.comwrote:

 Hi Mikhail,

 I have used wrong the term block join. When I said block join I was
 referring to a join performed on a single core versus cross join which was
 performed on multiple cores.
 But I saw your benchmark (from cache) and it seems that block join has
 better performance. Is this functionality available on Solr 4.3.1?

nope SOLR-3076 awaits for ages.


 I did not find such examples on Solr's wiki page.
 Does this functionality require a special schema, or a special indexing?

Special indexing - yes.


 How would I need to index the data from my tables? In my case anyway all
 the indices have a common schema since I am using dynamic fields, thus I
 can easily add all documents from all tables in one Solr core, but for each
 document to add a discriminator field.

correct. but notion of ' discriminator field' is a little bit different for
blockjoin.



 Could you point me to some more documentation?


I can recommend only those
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
http://www.youtube.com/watch?v=-OiIlIijWH0


 Thanks in advance,
 Mihaela


 
  From: Mikhail Khludnev mkhlud...@griddynamics.com
 To: solr-user solr-user@lucene.apache.org; mihaela olteanu 
 mihaela...@yahoo.com
 Sent: Thursday, July 11, 2013 2:25 PM
 Subject: Re: Performance of cross join vs block join


 Mihaela,

 For me it's reasonable that single core join takes the same time as cross
 core one. I just can't see which gain can be obtained from in the former
 case.
 I hardly able to comment join code, I looked into, it's not trivial, at
 least. With block join it doesn't need to obtain parentId term
 values/numbers and lookup parents by them. Both of these actions are
 expensive. Also blockjoin works as an iterator, but join need to allocate
 memory for parents bitset and populate it out of order that impacts
 scalability.
 Also in None scoring mode BJQ don't need to walk through all children, but
 only hits first. Also, nice feature is 'both side leapfrog' if you have a
 highly restrictive filter/query intersects with BJQ, it allows to skip many
 parents and children as well, that's not possible in Join, which has fairly
 'full-scan' nature.
 Main performance factor for Join is number of child docs.
 I'm not sure I got all your questions, please specify them in more details,
 if something is still unclear.
 have you saw my benchmark
 http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?



 On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com
 wrote:

  Hello,
 
  Does anyone know about some measurements in terms of performance for
 cross
  joins compared to joins inside a single index?
 
  Is it faster the join inside a single index that stores all documents of
  various types (from parent table or from children tables)with a
  discriminator field compared to the cross join (basically in this case
 each
  document type resides in its own index)?
 
  I have performed some tests but to me it seems that having a join in a
  single index (bigger index) does not add too much speed improvements
  compared to cross joins.
 
  Why a block join would be faster than a cross join if this is the case?
  What are the variables that count when trying to improve the query
  execution time?
 
  Thanks!
  Mihaela




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

 http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Performance of cross join vs block join

2013-07-12 Thread Mikhail Khludnev
Hello Roman,

Thanks for your interest. I briefly looked on your approach, and I'm really
interested in your numbers.

Here is the trivial code, I'd rather prefer rely on your testing framework,
and can provide you a version of Solr 4.2 with SOLR-3076 applied. Do you
need it?
https://github.com/m-khl/join-tester

What you are saying about benchmark representativeness definitely makes
sense. I didn't try to establish a complete absolutely representative
benchmark. Just wanted to have rough numbers, related for my usecase,
certainly. I'm from eCommerce, that volume was enough for me.

What I didn't get is, 'not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment'. Usually,
there is no problem with blocks in multi segment index, block definitely
can't span across segments. Anyway, please elaborate.
One of block join benefits is an ability to hit only the first matched
child in group, and jump over followings. It doesn't applicable in general,
but get huge gain some times.


On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Mikhail,
 I have commented on your blog, but it seems I have done st wrong, as the
 comment is not there. Would it be possible to share the test setup
 (script)?

 I have found out that the crucial thing with joins is the number of 'joins'
 [hits returned] and it seems that the experiments I have seen so far were
 geared towards small collection - even if Erick's index was 26M, the number
 of hits was probably small - you can see a very different story if you face
 some [other] real data. Here is a citation network and I was comparing
 lucene join's [ie not the block joins, because these cannot be used for
 citation data - we cannot reasonably index them into one segment])


 https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png

 Notice, the y axes is sqrt, so the running time for lucene join is growing
 and growing very fast! It takes lucene 30s to do the search that selects 1M
 hits.

 The comparison is against our own implementation of a similar search - but
 the main point I am making is that the join benchmarks should be showing
 the number of hits selected by the join operation. Otherwise, a very
 important detail is hidden.

 Best,

   roman


 On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com
  wrote:
 
   Hi Mikhail,
  
   I have used wrong the term block join. When I said block join I was
   referring to a join performed on a single core versus cross join which
  was
   performed on multiple cores.
   But I saw your benchmark (from cache) and it seems that block join has
   better performance. Is this functionality available on Solr 4.3.1?
 
  nope SOLR-3076 awaits for ages.
 
 
   I did not find such examples on Solr's wiki page.
   Does this functionality require a special schema, or a special
 indexing?
 
  Special indexing - yes.
 
 
   How would I need to index the data from my tables? In my case anyway
 all
   the indices have a common schema since I am using dynamic fields, thus
 I
   can easily add all documents from all tables in one Solr core, but for
  each
   document to add a discriminator field.
  
  correct. but notion of ' discriminator field' is a little bit different
 for
  blockjoin.
 
 
  
   Could you point me to some more documentation?
  
 
  I can recommend only those
 
 
 http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
  http://www.youtube.com/watch?v=-OiIlIijWH0
 
 
   Thanks in advance,
   Mihaela
  
  
   
From: Mikhail Khludnev mkhlud...@griddynamics.com
   To: solr-user solr-user@lucene.apache.org; mihaela olteanu 
   mihaela...@yahoo.com
   Sent: Thursday, July 11, 2013 2:25 PM
   Subject: Re: Performance of cross join vs block join
  
  
   Mihaela,
  
   For me it's reasonable that single core join takes the same time as
 cross
   core one. I just can't see which gain can be obtained from in the
 former
   case.
   I hardly able to comment join code, I looked into, it's not trivial, at
   least. With block join it doesn't need to obtain parentId term
   values/numbers and lookup parents by them. Both of these actions are
   expensive. Also blockjoin works as an iterator, but join need to
 allocate
   memory for parents bitset and populate it out of order that impacts
   scalability.
   Also in None scoring mode BJQ don't need to walk through all children,
  but
   only hits first. Also, nice feature is 'both side leapfrog' if you
 have a
   highly restrictive filter/query intersects with BJQ, it allows to skip
  many
   parents and children as well, that's not possible in Join, which has
  fairly
   'full-scan' nature.
   Main performance factor for Join is number of child docs.
   I'm not sure I got all your questions, please specify them in more

Re: Nested query in SOLR filter query (fq)

2013-07-15 Thread Mikhail Khludnev
Hello,

it sounds like FieldCollapsing or Join scenarios, but given the only
information which you provided, it can be solved by indexing statuses as
multivalue field:
 -ID-  -STATUS-
  id1(1 2 3 4)
  id2(1 2)
  id3(1)

q=*:*fq=STATUS:1fq=NOT STATUS:3




On Mon, Jul 15, 2013 at 3:19 PM, EquilibriumCST valeri_ho...@abv.bg wrote:

 Hi all,

 I have the following case.

 Solr documents has fields -- id and status. Id is not unique. Unique is
 the
 combination of these two elements.
 Documents with same id have different statuses.

 List of Documents

  -ID-  -STATUS-
   id11
   id12
   id13
   id14
   id21
   id22
   id31

 I need to make query that takes all documents with specific status and to
 exclude documents that don't have other specific status.
 As an example I need to get all documents with status 2 and don't have
 status 3.
 The expected result should be document :
  id22

 Another example: all documents with status 1 and don't have status 3. Then
 the result should be:
  id21
  id31

 Here is my query that don't work

 http://192.168.130.14:13080/solr/select/?q=status:1version=2.2start=0rows=10indent=onfl=id,statusfq=-id:(*:*%20AND%20status:2)
 The problem is in filter query(fq) part. In fq must be the ids of the
 documents with status 2 and if the current document id is in this list to
 be
 excluded.
 I guess some subquery must be used in fq part or something else.
 Just for information we are using APACHE SOLR 3.6 and document count is
 around 100k.

 Thanks in advance!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nested-query-in-SOLR-filter-query-fq-tp4078020.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: short-circuit OR operator in lucene/solr

2013-07-22 Thread Mikhail Khludnev
Short answer, no - it has zero sense.

But after some thinking, it can make some sense, potentially.
DisjunctionSumScorer holds child scorers semi-ordered in a binary heap.
Hypothetically inequality can be enforced at that heap, but heap might not
work anymore for such alignment. Hence, instead of heap TreeSet can be used
for experiment.
fwiw, it's a dev list question.


On Mon, Jul 22, 2013 at 4:48 AM, Deepak Konidena deepakk...@gmail.comwrote:

 I understand that lucene's AND (), OR (||) and NOT (!) operators are
 shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why
 one can't treat them as boolean operators (adhering to boolean algebra).

 I have been trying to construct a simple OR expression, as follows

 q = +(field1:value1 OR field2:value2)

 with a match on either field1 or field2. But since the OR is merely an
 optional, documents where both field1:value1 and field2:value2 are matched,
 the query returns a score resulting in a match on both the clauses.

 How do I enforce short-circuiting in this context? In other words, how to
 implement short-circuiting as in boolean algebra where an expression A || B
 || C returns true if A is true without even looking into whether B or C
 could be true.
 -Deepak




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Appending *-wildcard suffix on all terms for querying: move logic from client to server side

2013-07-23 Thread Mikhail Khludnev
It can be done by extending LuceneQParser/SolrQueryParser see
http://wiki.apache.org/solr/SolrPlugins#QParserPlugin
there is newTermQuery(Term) it should be overridden and delegate to
newPrefixQuery() method.
Overall, I suggest you consider to use EdgeNGramTokenFilter in index time,
and then search by plain termqueries.


On Tue, Jul 23, 2013 at 2:05 PM, Paul Blanchaert p...@amosis.eu wrote:

 My client has an installation with 3 different clients using the same Solr
 index. These clients all append a * wildcard suffix in the query: user
 enters abc def while search is performed against (abc* def*).
 In order to move away from this way of searching, we'd like to move the
 clients away from this wildcard search at the moment we implement a new
 index. However, at that time, the client apps will still need to use this
 wildcard suffix search. So the goal is to have the wildcard search option
 to append * suffix when not yet set configurable on server side.
 I thought a tokenizer would do the work, but as the wildcard searches are
 detected before analyzers do the work, this is not an option.
 Can I enable this without coding? Or should I use a (custom) functionquery
 or custom search handler?
 Any thought is appreciated.


 -
 Kind regards,

 Paul Blanchaert




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: How to make soft commit more reliable?

2013-07-24 Thread Mikhail Khludnev
Hello,

First of all, I don't think it can commit (even soft) every second, afaik
it's too frequent for typical deployment. Hence, if you really need such
(N)RT I suggest you experiment with it right now, to face the bummer
sooner.
Also, one second durability sounds like over-expectation for Solr, it
sounds like OLTP requirements.
Then, now Solr has some sort of pre-indexing record storage called
UpdateLog. try to experiment with syncLevel = FSYNC vs FLUSH .
That's how it works, when document arrives for indexing it's written into
update log, which is plain binary file. Indexing works as-is relying on
RAMbuffer. When node dies, RAMbuffer dies, but updateLog is persistent,
during startup Solr recovers uncommitted updates from updateLog. Caveat!
UpdateLog has HashMap internally which easily hits OOM on rare commits.



On Wed, Jul 24, 2013 at 2:56 AM, SolrLover bbar...@gmail.com wrote:

 Currently I am using SOLR 3.5.X and I push updates to SOLR via queue
 (Active
 MQ) and perform hard commit every 30 minutes (since my index is relatively
 big around 30 million documents). I am thinking of using soft commit to
 implement NRT search but I am worried about the reliability.

 For ex: If I have the hard autocommit set to 10 minutes and a softcommit
 every second, new documents will show up every second but in case of JVM
 crash or power goes out I will lose all the documents after the last hard
 commit.

 I was thinking of using a backup database or another SOLR index that I can
 use as a backup and write the document from queue in both places (one with
 soft commit, another index with just the push updates with normal hard
 commits (or) write simultaneously to a db and delete the rows once the hard
 commit is successful after making sure that we didn't lose any records).

 Does someone have any other idea to improve the reliability of the push
 updates when using soft commit?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-make-soft-commit-more-reliable-tp4079892.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
Roman,

Can you disclosure how that streaming writer works? What does it stream
docList or docSet?

Thanks


On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello Matt,

 You can consider writing a batch processing handler, which receives a query
 and instead of sending results back, it writes them into a file which is
 then available for streaming (it has its own UUID). I am dumping many GBs
 of data from solr in few minutes - your query + streaming writer can go
 very long way :)

 roman


 On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:

  Hello Solr users,
 
  Question regarding processing a lot of docs returned from a query; I
  potentially have millions of documents returned back from a query. What
 is
  the common design to deal with this ?
 
  2 ideas I have are:
  - create a client service that is multithreaded to handled this
  - Use the Solr pagination to retrieve a batch of rows at a time
 (start,
  rows in Solr Admin console )
 
  Any other ideas that I may be missing ?
 
  Thanks,
  Matt
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that the
  communication is free of errors, virus, interception or interference.
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
fwiw,
i did some prototype with the following differences:
- it streams straight to the socket output stream
- it streams on-going during collecting, without necessity to store a
bitset.
It might have some limited extreme usage. Is there anyone interested?


On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla roman.ch...@gmail.com wrote:

 On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote:

  That sounds like a satisfactory solution for the time being -
  I am assuming you dump the data from Solr in a csv format?
 

 JSON


  How did you implement the streaming processor ? (what tool did you use
 for
  this? Not familiar with that)
 

 this is what dumps the docs:

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java

 it is called by one of our batch processors, which can pass it a bitset of
 recs

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java

 as far as streaming is concerned, we were all very nicely surprised, a few
 GB file (on local network) took ridiculously short time - in fact, a
 colleague of mine was assuming it is not working, until we looked into the
 downloaded file ;-), you may want to look at line 463

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java

 roman


  You say it takes a few minutes only to dump the data - how long does it
 to
  stream it back in, are performances acceptable (~ within minutes) ?
 
  Thanks,
  Matt
 
  On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote:
 
  Hello Matt,
  
  You can consider writing a batch processing handler, which receives a
  query
  and instead of sending results back, it writes them into a file which is
  then available for streaming (it has its own UUID). I am dumping many
 GBs
  of data from solr in few minutes - your query + streaming writer can go
  very long way :)
  
  roman
  
  
  On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com
 wrote:
  
   Hello Solr users,
  
   Question regarding processing a lot of docs returned from a query; I
   potentially have millions of documents returned back from a query.
 What
  is
   the common design to deal with this ?
  
   2 ideas I have are:
   - create a client service that is multithreaded to handled this
   - Use the Solr pagination to retrieve a batch of rows at a time
  (start,
   rows in Solr Admin console )
  
   Any other ideas that I may be missing ?
  
   Thanks,
   Matt
  
  
   
  
  
  
  
  
  
   NOTE: This message may contain information that is confidential,
   proprietary, privileged or otherwise protected by law. The message is
   intended solely for the named addressee. If received in error, please
   destroy and notify the sender. Any use of this email is prohibited
 when
   received in error. Impetus does not represent, warrant and/or
 guarantee,
   that the integrity of this communication has been maintained nor that
  the
   communication is free of errors, virus, interception or interference.
  
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that the
  communication is free of errors, virus, interception or interference.
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Synonym Phrase

2013-07-27 Thread Mikhail Khludnev
Hello,

As far as I know
http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ has some
usage in the industry.


On Fri, Jul 26, 2013 at 8:28 PM, Jack Krupansky j...@basetechnology.comwrote:

 Hmmm... Actually, I think there was also a solution where you could
 specify an alternate tokenizer for the synonym file which would not
 tokenize on space, so that the full phrase would be passed to the query
 parser/generator as a single term so that it would generate a phrase (if
 you have the autogeneratePhraseQuery attribute of the field type set to
 true.) But, I don't recall the details... and it's not the default, which
 maybe it should be.


 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Friday, July 26, 2013 12:18 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Synonym Phrase


 Why Solr does not split that terms by*;* I think that it both split
 by   *;*   and white space character?

 2013/7/26 Jack Krupansky j...@basetechnology.com

  Well, that's one of the areas where Solr synonym support breaks down. The
 LucidWorks Search query parser has a proprietary solution for that
 problem,
 but it won't help you with bare Solr. Some people have used shingles.

 In short, for query-time synonym phrases your best bet is to parse the
 query at the application level and generate a Solr query that has the
 synonyms pre-expanded.

 Application preprocessing could be as simple as scanning for the synonym
 phrases and then adding OR terms for the synonym phrases.

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Friday, July 26, 2013 10:53 AM
 To: solr-user@lucene.apache.org
 Subject: Synonym Phrase


 I have a synonyms file as like that:

 cart; shopping cart; market trolley

 When I analyse my query I see that when I search cart these becomes
 synonyms:


 cart, shopping, market, trolley

 so cart is synonym with shopping. How should I define my synonyms.txt file
 that it will understand that cart is synonym to shopping cart?





-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Otis,
You gave links to 'deep paging' when I asked about response streaming.
Let me understand. From my POV, deep paging is a special case for regular
search scenarios. We definitely need it in Solr. However, if we are talking
about data analytic like problems, when we need to select an endless
stream of responses (or store them in file as Roman did), 'deep paging' is
a suboptimal hack.
What's your vision on this?


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Roman,

Let me briefly explain  the design

special RequestParser stores servlet output stream into the context
https://github.com/m-khl/solr-patches/compare/streaming#L7R22

then special component injects special PostFilter/DelegatingCollector which
writes right into output
https://github.com/m-khl/solr-patches/compare/streaming#L2R146

here is how it streams the doc, you see it's lazy enough
https://github.com/m-khl/solr-patches/compare/streaming#L2R181

I mention that it disables later collectors
https://github.com/m-khl/solr-patches/compare/streaming#L2R57
hence, no facets with streaming, yet as well as memory consumption.

This test shows how it works
https://github.com/m-khl/solr-patches/compare/streaming#L15R115

all other code purposed for distributed search.



On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Mikhail,
 If your solution gives lazy loading of solr docs /and thus streaming of
 huge result lists/ it should be big YES!
 Roman
 On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

  Otis,
  You gave links to 'deep paging' when I asked about response streaming.
  Let me understand. From my POV, deep paging is a special case for regular
  search scenarios. We definitely need it in Solr. However, if we are
 talking
  about data analytic like problems, when we need to select an endless
  stream of responses (or store them in file as Roman did), 'deep paging'
 is
  a suboptimal hack.
  What's your vision on this?
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Hello,

Please find below


 Let me just explain better what I found when I dug inside solr: documents
 (results of the query) are loaded before they are passed into a writer - so
 the writers are expecting to encounter the solr documents, but these
 documents were loaded by one of the components before rendering them - so
 it is kinda 'hard-coded'.

there is the code
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L445which
pulls documents into document's cache
to achieve your goal you can try to remove documents cache, or disable lazy
fields loading.


 But if solr was NOT loading these docs before
 passing them to a writer, writer can load them instead (hence lazy loading,
 but the difference is in numbers - it could deal with hundreds of thousands
 of docs, instead of few thousands now).


anyway, even if writer pulls docs one by one, it doesn't allow to stream a
billion of them. Solr writes out DocList, which is really problematic even
in deep-paging scenarios.




 roman


 On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Roman,
 
  Let me briefly explain  the design
 
  special RequestParser stores servlet output stream into the context
  https://github.com/m-khl/solr-patches/compare/streaming#L7R22
 
  then special component injects special PostFilter/DelegatingCollector
 which
  writes right into output
  https://github.com/m-khl/solr-patches/compare/streaming#L2R146
 
  here is how it streams the doc, you see it's lazy enough
  https://github.com/m-khl/solr-patches/compare/streaming#L2R181
 
  I mention that it disables later collectors
  https://github.com/m-khl/solr-patches/compare/streaming#L2R57
  hence, no facets with streaming, yet as well as memory consumption.
 
  This test shows how it works
  https://github.com/m-khl/solr-patches/compare/streaming#L15R115
 
  all other code purposed for distributed search.
 
 
 
  On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Mikhail,
   If your solution gives lazy loading of solr docs /and thus streaming of
   huge result lists/ it should be big YES!
   Roman
   On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com
   wrote:
  
Otis,
You gave links to 'deep paging' when I asked about response
 streaming.
Let me understand. From my POV, deep paging is a special case for
  regular
search scenarios. We definitely need it in Solr. However, if we are
   talking
about data analytic like problems, when we need to select an
 endless
stream of responses (or store them in file as Roman did), 'deep
 paging'
   is
a suboptimal hack.
What's your vision on this?
   
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley yo...@lucidworks.com wrote:


 Which part is problematic... the creation of the DocList (the search),

Literally DocList is a copy of TopDocs. Creating TopDocs is not a search,
but ranking.
And ranking costs is log(rows+start) beside of numFound, which the search
takes.
Interesting that we still pay that log() even if ask for collecting docs
as-is with _docid_


 or it's memory requirements (an int per doc)?

TopXxxCollector as well as XxxComparators allocates same [rows+start]

it's clear that after we have deep paging, we need to handle heaps just
with size of rows (without start).
It's fairly ok, if we use Solr like site navigation engine, but it's
'sub-optimal' for data analytic use-cases, where we need something like
SELECT * FROM ... in rdbms. In this case any memory allocation on billions
docs index is a bummer. That's why I'm asking about removing heap based
collector/comparator.


 -Yonik
 http://lucidworks.com




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Mikhail Khludnev
Mishra,
What if you setup DIH with single SQLEntityProcessor without caching, does
it works for you?


On Mon, Jul 29, 2013 at 4:00 PM, Santanu8939967892 mishra.sant...@gmail.com
 wrote:

 Hi,
I have a huge volume of DB records, which is close to 250 millions.
 I am going to use DIH to index the data into Solr.
 I need a best architecture to index and query the data in an efficient
 manner.
 I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.


 With Regards,
 Santanu




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Pentaho Kettle vs DIH

2013-07-29 Thread Mikhail Khludnev
Hello,

Don't you have any experience with using Pentaho Kettle for processing
RDBMS and pouring them into Solr? Isn't it some sort of replacement of the
DIH?

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Mikhail Khludnev
Dotan,

Could you please provide more line of the stack trace?
I have no idea why it made worse at 4.3. I know that 4.3 can use facets
backed on DocValues, which are modest for the heap. But from what I saw,
but can be wrong it's disabled from numeric facets. Hence, I can suggest to
reindex id as string docvalues and hope for them. However, it's doubtful to
reindex everything without strong guaranties.
Also, I checked source code of
http://wiki.apache.org/solr/TermsComponentand found that it can be
really memory modest (ie without sort nor limit).
Be aware that df-s returned by that component are unaware of deleted
document, hence expungeDeletes before.


On Tue, Jul 30, 2013 at 10:16 PM, Dotan Cohen dotanco...@gmail.com wrote:

 To search for duplicate IDs, I am running the following query:
 select?q=*:*facet=truefacet.field=idrows=0

 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving
 OutOfMemoryError errors instead of the desired facet:

 responselst name=errorstr
 name=msgjava.lang.OutOfMemoryError: Java heap space/strstr
 name=tracejava.lang.RuntimeException: java.lang.OutOfMemoryError:
 Java heap space
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at ...

 Might there be a less resource-intensive way to get this information.
 This is Solr 4.3 running on Ubuntu Server 12.04 in Jetty. The index
 has over 100,000,000 small records, for a total of about 95 GiB of
 disk space, with Solr running on it's own disk. Actually, the 'disk'
 is an Amazon Web Service EBS volume.

 --
 Dotan Cohen

 http://gibberish.co.il
 http://what-is-what.com




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


<    1   2   3   4   5   6   7   8   9   10   >