Re: solr broke a pipe

2012-05-02 Thread Mikhail Khludnev
It seems like slave instance start to pull the index from the master and
then die, it causes broken pipe at master node.

On Thu, May 3, 2012 at 3:31 AM, Robert Petersen  wrote:

> Anyone have any clues about this exception?  It happened during the
> course of normal indexing.  This is new to me (we're running solr 3.6 on
> tomcat 6/redhat RHEL) and we've been running smoothly for some time now
> until this showed up:
>
> >>>Red Hat Enterprise Linux Server release 5.3 (Tikanga)
>
> >>>
>
> >>>Apache Tomcat Version 6.0.20
>
> >>>
>
> >>>java.runtime.version = 1.6.0_25-b06
>
> >>>
>
> >>>java.vm.name = Java HotSpot(TM) 64-Bit Server VM
>
>
>
> May 2, 2012 4:07:48 PM
> org.apache.solr.handler.ReplicationHandler$FileStream write
>
> WARNING: Exception while writing response for params:
> indexversion=1276893500358&file=_1uca.frq&command=filecontent&checksum=t
> rue&wt=filestream
>
> ClientAbortException:  java.net.SocketException: Broken pipe
>
>at
> org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j
> ava:358)
>
>at
> org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:354)
>
>at
> org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:
> 381)
>
>at
> org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:370)
>
>at
> org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStrea
> m.java:89)
>
>at
> org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java
> :87)
>
>at
> org.apache.solr.handler.ReplicationHandler$FileStream.write(ReplicationH
> andler.java:1076)
>
>at
> org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.ja
> va:936)
>
>at
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFil
> ter.java:345)
>
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
> ava:273)
>
>at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
> tionFilterChain.java:235)
>
>at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
> erChain.java:206)
>
>at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
> e.java:233)
>
>at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
> e.java:191)
>
>at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
> :128)
>
>at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
> :102)
>
>at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
> java:109)
>
>at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
> 93)
>
>at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
> 9)
>
>at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
> Http11Protocol.java:583)
>
>at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
>
>at java.lang.Thread.run(Unknown Source)
>
> Caused by: java.net.SocketException: Broken pipe
>
>at java.net.SocketOutputStream.socketWrite0(Native
> Method)
>
>at java.net.SocketOutputStream.socketWrite(Unknown
> Source)
>
>at java.net.SocketOutputStream.write(Unknown Source)
>
>at
> org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOut
> putBuffer.java:740)
>
>at
> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)
>
>at
> org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:349)
>
>at
> org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.d
> oWrite(InternalOutputBuffer.java:764)
>
>at
> org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutp
> utFilter.java:126)
>
>at
> org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuff
> er.java:573)
>
>at org.apache.coyote.Response.doWrite(Response.java:560)
>
>at
> org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j
> ava:353)
>
>... 21 more
>
>


-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics


 


Parent-Child relationship

2012-05-02 Thread tamanjit.bin...@yahoo.co.in
Hi,
I just wanted to get some information about whether Parent-Child
relationship between documents which Lucene has been talking about has been
implemented in Solr or not? I know join patch is available, would that be
the only solution?

And another question, as and when this will be possible (if its not done
already), would such a functionality (whether join or defining such
relations at index time) would be available across different cores?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Parent-Child-relationship-tp3958259.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Grouping ngroups count

2012-05-02 Thread Martijn v Groningen
Hi Francois,

The issue you describe looks like a similar issue we have fixed before
with matches count.
Open an issue and we can look into it.

Martijn

On 1 May 2012 20:14, Francois Perron
 wrote:
> Thanks for your response Cody,
>
>  First, I used distributed grouping on 2 shards and I'm sure then all 
> documents of each group are in the same shard.
>
> I take a look on JIRA issue and it seem really similar.  There is the same 
> problem with group.ngroups.  The count is calculated in second pass so we 
> only had result from "useful" shards and it's why when I increase rows limit 
> i got the right count (they must use all my shards).
>
> Except it's a feature (i hope not), I will create a new JIRA issue for this.
>
> Thanks
>
> On 2012-05-01, at 12:32 PM, Young, Cody wrote:
>
>> Hello,
>>
>> When you say 2 slices, do you mean 2 shards? As in, you're doing a 
>> distributed query?
>>
>> If you're doing a distributed query, then for group.ngroups to work you need 
>> to ensure that all documents for a group exist on a single shard.
>>
>> However, what you're describing sounds an awful lot like this JIRA issue 
>> that I entered a while ago for distributed grouping. I found that the hit 
>> count was coming only from the shards that ended up having results in the 
>> documents that were returned. I didn't test group.ngroups at the time.
>>
>> https://issues.apache.org/jira/browse/SOLR-3316
>>
>> If this is a similar issue then you should make a new Jira issue.
>>
>> Cody
>>
>> -Original Message-
>> From: Francois Perron [mailto:francois.per...@wantedanalytics.com]
>> Sent: Tuesday, May 01, 2012 6:47 AM
>> To: solr-user@lucene.apache.org
>> Subject: Grouping ngroups count
>>
>> Hello all,
>>
>>  I tried to use grouping with 2 slices with a index of 35K documents.  When 
>> I ask top 10 rows, grouped by filed A, it gave me about 16K groups.  But, if 
>> I ask for top 20K rows, the ngroups property is now at 30K.
>>
>> Do you know why and of course how to fix it ?
>>
>> Thanks.
>



-- 
Met vriendelijke groet,

Martijn van Groningen


Re: Lucene FieldCache - Out of memory exception

2012-05-02 Thread Rahul R
Jack,
Yes, the queries work fine till I hit the OOM. The fields that start with
S_* are strings, F_* are floats, I_* are ints and so so. The dynamic field
definitions from schema.xml :
 
   
   
   
   

*Each FieldCache will be an array with maxdoc entries (your total number of
documents - 1.4 million) times the size of the field value or whatever a
string reference is in your JVM*
So if I understand correct - every field (dynamic or normal) will have its
own field cache. The size of the field cache for any field will be (maxDocs
* sizeOfField) ? If the field has only 100 unique values, will it occupy
(100 * sizeOfField) or will it still be (maxDocs * sizeOfField) ?

*Roughly what is the typical or average length of one of your facet field
values? And, on average, how many unique terms are there within a typical
faceted field?*
Each field length may vary from 10 - 30 characters. Average of 20 maybe.
Number of unique terms within a faceted field will vary from 100 - 1000.
Average of 300. How will the number of unique terms affect performance ?

*3 GB sounds like it might not be enough for such heavy use of faceting. It
is probably not the 50-70 number, but the 440 or accumulated number across
many queries that pushes the memory usage up*
I am using jdk1.5.0_14 - 32 bit. With 32 bit jdk, I think there is a
limitation that more RAM cannot be allocated.

*When you hit OOM, what does the Solr admin stats display say for
FieldCache?*
I don't have solr deployed as a separate web app. All solr jar files are
present in my webapp's WEB-INF\lib directory. I use EmbeddedSolrServer. So
is there a way I can get this information that the admin would show ?

Thank you for your time.

-Rahul


On Wed, May 2, 2012 at 5:19 PM, Jack Krupansky wrote:

> The FieldCache gets populated the first time a given field is referenced
> as a facet and then will stay around forever. So, as additional queries get
> executed with different facet fields, the number of FieldCache entries will
> grow.
>
> If I understand what you have said, theses faceted queries do work
> initially, but after awhile they stop working with OOM, correct?
>
> The size of a single FieldCache depends on the field type. Since you are
> using dynamic fields, it depends on your "dynamicField" types - which you
> have not told us about. From your query I see that your fields start with
> "S_" and "F_" - presumably you have dynamic field types "S_*" and "F_*"?
> Are they strings, integers, floats, or what?
>
> Each FieldCache will be an array with maxdoc entries (your total number of
> documents - 1.4 million) times the size of the field value or whatever a
> string reference is in your JVM.
>
> String fields will take more space than numeric fields for the FieldCache,
> since a separate table is maintained for the unique terms in that field.
> Roughly what is the typical or average length of one of your facet field
> values? And, on average, how many unique terms are there within a typical
> faceted field?
>
> If you can convert many of these faceted fields to simple integers the
> size should go down dramatically, but that depends on your application.
>
> 3 GB sounds like it might not be enough for such heavy use of faceting. It
> is probably not the 50-70 number, but the 440 or accumulated number across
> many queries that pushes the memory usage up.
>
> When you hit OOM, what does the Solr admin stats display say for
> FieldCache?
>
> -- Jack Krupansky
>
> -Original Message- From: Rahul R
> Sent: Wednesday, May 02, 2012 2:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Lucene FieldCache - Out of memory exception
>
>
> Here is one sample query that I picked up from the log file :
>
> q=*%3A*&fq=Category%3A%223__**107%22&fq=S_P1540477699%3A%**
> 22MICROCIRCUIT%2C+LINE+**TRANSCEIVERS%22&rows=0&facet=**
> true&facet.mincount=1&facet.**limit=2&facet.field=S_**
> C1503120369&facet.field=S_**P1406389942&facet.field=S_**
> P1430116878&facet.field=S_**P1430116881&facet.field=S_**
> P1406453552&facet.field=S_**P1406451296&facet.field=S_**
> P1406452465&facet.field=S_**C2968809156&facet.field=S_**
> P1406389980&facet.field=S_**P1540477699&facet.field=S_**
> P1406389982&facet.field=S_**P1406389984&facet.field=S_**
> P1406451284&facet.field=S_**P1406389926&facet.field=S_**
> P1424886581&facet.field=S_**P2017662632&facet.field=F_**
> P1946367021&facet.field=S_**P1430116884&facet.field=S_**
> P2017662620&facet.field=F_**P1406451304&facet.field=F_**
> P1406451306&facet.field=F_**P1406451308&facet.field=S_**
> P1500901421&facet.field=S_**P1507138990&facet.field=I_**
> P1406452433&facet.field=I_**P1406453565&facet.field=I_**
> P1406452463&facet.field=I_**P1406453573&facet.field=I_**
> P1406451324&facet.field=I_**P1406451288&facet.field=S_**
> P1406451282&facet.field=S_**P1406452471&facet.field=S_**P14248866
> 05&facet.field=S_P1946367015&**facet.field=S_P1424886598&**
> facet.field=S_P1946367018&**facet.field=S_P1406453556&**
> facet.field=S_P1406389932&**facet.fiel

Re: syntax for negative query OR something

2012-05-02 Thread Ryan McKinley
thanks!



On Wed, May 2, 2012 at 4:43 PM, Chris Hostetter
 wrote:
>
> : How do I search for things that have no value or a specified value?
>
> Things with no value...
>        (*:* -fieldName:[* TO *])
> Things with a specific value...
>        fieldName:A
> Things with no value or a specific value...
>        (*:* -fieldName:[* TO *]) fieldName:A
> ..."or" if you aren't using "OR" as your default op
>        (*:* -fieldName:[* TO *]) OR fieldName:A
>
> : I have a few variations of:
> : -fname:[* TO *] OR fname:(A B C)
>
> that is just syntacitic sugar for...
>        -fname:[* TO *] fname:(A B C)
>
> which is an empty set.
>
> you need to be explicit that the "exclude docs with a value in this field"
> clause should applied to the "set of all documents"
>
>
> -Hoss


Re: synonyms

2012-05-02 Thread Sohail Aboobaker
I think regular sync of database table with synonym text file seems to be
simplest of the solutions. It will allow you to use Solr natively without
any customization and it is not very complicated operation to update
synonyms file with entries in database.

>
>


Re: Solr 3.5 - Elevate.xml causing issues when placed under /data directory

2012-05-02 Thread Koji Sekiguchi

(12/05/03 1:39), Noordeen, Roxy wrote:

Hello,
I just started using elevation for solr. I am on solr 3.5, running with Drupal 
7, Linux.

1. I updated my solrconfig.xml
from
${solr.data.dir:./solr/data}

To
/usr/local/tomcat2/data/solr/dev_d7/data

2. I placed my elevate.xml in my solr's data directory. Based on forum answers, 
I thought placing elevate.xml under data directory would pick my latest change.
I restarted tomcat.

3.  When i placed my elevate.xml under conf directory, elevation was working 
with url:

http://mysolr.www.com:8181/solr/elevate?q=games&wt=xml&sort=score+desc&fl=id,bundle_name

But when i moved to data directory,  I am not seeing any results.

NOTE: I can see the catalina.out, printing solr reading the file from data 
directory. I tried to give invalid entries; I noticed solr errors parsing 
elevate.xml from data directory. I even tried to send some documents to index, 
thought commit might help to read the elevate config file. But nothing helped.

I don't understand why below url does not work anymore.  There are no errors in 
the log files.

http://mysolr.www.com:8181/solr/elevate?q=games&wt=xml&sort=score+desc&fl=id,bundle_name


Any help on this topic is appreciated.


Hi Noordeen,

What do you mean by "I am not seeing any results."? Is it no docs in response
(numFound="0") ?

And have you tried the original "${solr.data.dir:./solr/data}" for the dataDir?
Isn't it working for you too?

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/


Re: synonyms

2012-05-02 Thread Jack Krupansky
There are lots of different strategies for dealing with synonyms, depending 
on what exactly is most important and what exactly your are willing to 
tolerate.


In your latest example, you seem to be using string fields, which is 
somewhat different form the text synonyms we talk about in Solr. You can 
certainly have multiple string fields, or even a multi-valued string field 
to store variations on selected categories of terms. That works well when 
you have a well-defined number of categories. So, you can have a user query 
go against a combination of normal text fields and these category string 
fields.


If that is sufficient for your application, great.

-- Jack Krupansky

-Original Message- 
From: Carlos Andres Garcia

Sent: Wednesday, May 02, 2012 6:57 PM
To: solr-user@lucene.apache.org
Subject: RE: synonyms

Thanks for your answers, now I have another cuestions,if I develop the
filter to replacement the current synonym filter,I understand that this
procces would be in time of the indexing because in time of the query search
there are a lot problems knows. if so, how can I do for create my  index
file.

For example:
I have two synonyms "Nou cam", "Cataluña"  for  barcelona  in the data base


Opcion 1)
In time of the indexing would create 2 records like this:


  barcelona
  Camp Nou
...


and


  barcelona
  Cataluña
...


Opcion 2)

or only would create  one record like this:


  barcelona
  Camp Nou,Cataluña
...



If it create the opcion 2 can looking for by Camp Nou y by Cataluña but when
I looking for by barcelona the Solr return 2 records and that is one error
because barcelona is only one

IF it create the opcion 2 , I have searching wiht wildcards for example 
*Camp Nou* o *Cataluña* y the solr would return one records, the same case 
if searching by barcelona solr would return one recors that is good , but i 
want to know if is the better option or solr have another caracteristic 
betters  that can resolve this topic of one better way. 



Re: syntax for negative query OR something

2012-05-02 Thread Jack Krupansky
Hmmm... I thought that worked in edismax. And I thought that pure negative 
queries were allowed in SolrQueryParser. Oh well.


In any case, in the Lucene or Solr query parser, add "*:*" to select all 
docs before negating the docs that have any value in the field:


(*:* -fname:*) OR fname:(A B C)

or

(*:* -fname:[* TO *]) OR fname:(A B C)

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, May 02, 2012 7:52 PM
To: solr-user@lucene.apache.org
Subject: Re: syntax for negative query OR something

Oops... that is:

(-fname:*) OR fname:(A B C)

or

(-fname:[* TO *]) OR fname:(A B C)

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, May 02, 2012 7:48 PM
To: solr-user@lucene.apache.org
Subject: Re: syntax for negative query OR something

Sounds good. "OR" in the negation of any query that matches any possible
value in a field.

The Solr query parser doc lists the open range as you used:

 -field:[* TO *] finds all documents without a value for field

See:
http://wiki.apache.org/solr/SolrQuerySyntax

This also include pure wildcard that can generate a PrefixQuery:

  -fname:* OR fname:(A B C)


-- Jack Krupansky

-Original Message- 
From: Ryan McKinley

Sent: Wednesday, May 02, 2012 7:18 PM
To: solr-user@lucene.apache.org
Subject: syntax for negative query OR something

How do I search for things that have no value or a specified value?

Essentially I have a field that *may* exist and what the absense of a
field to also match.

I have a few variations of:
-fname:[* TO *] OR fname:(A B C)


Thanks for any pointers

ryan 



Re: syntax for negative query OR something

2012-05-02 Thread Jack Krupansky

Oops... that is:

(-fname:*) OR fname:(A B C)

or

(-fname:[* TO *]) OR fname:(A B C)

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky 
Sent: Wednesday, May 02, 2012 7:48 PM 
To: solr-user@lucene.apache.org 
Subject: Re: syntax for negative query OR something 

Sounds good. "OR" in the negation of any query that matches any possible 
value in a field.


The Solr query parser doc lists the open range as you used:

 -field:[* TO *] finds all documents without a value for field

See:
http://wiki.apache.org/solr/SolrQuerySyntax

This also include pure wildcard that can generate a PrefixQuery:

  -fname:* OR fname:(A B C)


-- Jack Krupansky

-Original Message- 
From: Ryan McKinley

Sent: Wednesday, May 02, 2012 7:18 PM
To: solr-user@lucene.apache.org
Subject: syntax for negative query OR something

How do I search for things that have no value or a specified value?

Essentially I have a field that *may* exist and what the absense of a
field to also match.

I have a few variations of:
-fname:[* TO *] OR fname:(A B C)


Thanks for any pointers

ryan 


Re: syntax for negative query OR something

2012-05-02 Thread Jack Krupansky
Sounds good. "OR" in the negation of any query that matches any possible 
value in a field.


The Solr query parser doc lists the open range as you used:

 -field:[* TO *] finds all documents without a value for field

See:
http://wiki.apache.org/solr/SolrQuerySyntax

This also include pure wildcard that can generate a PrefixQuery:

  -fname:* OR fname:(A B C)


-- Jack Krupansky

-Original Message- 
From: Ryan McKinley

Sent: Wednesday, May 02, 2012 7:18 PM
To: solr-user@lucene.apache.org
Subject: syntax for negative query OR something

How do I search for things that have no value or a specified value?

Essentially I have a field that *may* exist and what the absense of a
field to also match.

I have a few variations of:
-fname:[* TO *] OR fname:(A B C)


Thanks for any pointers

ryan 



Re: syntax for negative query OR something

2012-05-02 Thread Chris Hostetter

: How do I search for things that have no value or a specified value?

Things with no value...
(*:* -fieldName:[* TO *])
Things with a specific value...
fieldName:A
Things with no value or a specific value...
(*:* -fieldName:[* TO *]) fieldName:A
..."or" if you aren't using "OR" as your default op
(*:* -fieldName:[* TO *]) OR fieldName:A

: I have a few variations of:
: -fname:[* TO *] OR fname:(A B C)

that is just syntacitic sugar for...
-fname:[* TO *] fname:(A B C)

which is an empty set.

you need to be explicit that the "exclude docs with a value in this field" 
clause should applied to the "set of all documents"


-Hoss


solr broke a pipe

2012-05-02 Thread Robert Petersen
Anyone have any clues about this exception?  It happened during the
course of normal indexing.  This is new to me (we're running solr 3.6 on
tomcat 6/redhat RHEL) and we've been running smoothly for some time now
until this showed up:

>>>Red Hat Enterprise Linux Server release 5.3 (Tikanga)

>>> 

>>>Apache Tomcat Version 6.0.20

>>> 

>>>java.runtime.version = 1.6.0_25-b06

>>> 

>>>java.vm.name = Java HotSpot(TM) 64-Bit Server VM

 

May 2, 2012 4:07:48 PM
org.apache.solr.handler.ReplicationHandler$FileStream write

WARNING: Exception while writing response for params:
indexversion=1276893500358&file=_1uca.frq&command=filecontent&checksum=t
rue&wt=filestream

ClientAbortException:  java.net.SocketException: Broken pipe

at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j
ava:358)

at
org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:354)

at
org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:
381)

at
org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:370)

at
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStrea
m.java:89)

at
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java
:87)

at
org.apache.solr.handler.ReplicationHandler$FileStream.write(ReplicationH
andler.java:1076)

at
org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.ja
va:936)

at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFil
ter.java:345)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:273)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206)

at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233)

at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191)

at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128)

at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102)

at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109)

at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
93)

at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
9)

at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:583)

at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)

at java.lang.Thread.run(Unknown Source)

Caused by: java.net.SocketException: Broken pipe

at java.net.SocketOutputStream.socketWrite0(Native
Method)

at java.net.SocketOutputStream.socketWrite(Unknown
Source)

at java.net.SocketOutputStream.write(Unknown Source)

at
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOut
putBuffer.java:740)

at
org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)

at
org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:349)

at
org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.d
oWrite(InternalOutputBuffer.java:764)

at
org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutp
utFilter.java:126)

at
org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuff
er.java:573)

at org.apache.coyote.Response.doWrite(Response.java:560)

at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j
ava:353)

... 21 more



Re: Solr Merge during off peak times

2012-05-02 Thread Otis Gospodnetic
Hello Prabhu,

Look at SPM for Solr (URL in sig below).  It includes Index Statistics graphs, 
and from these graphs you can tell:

* how many docs are in your index
* how many docs are deleted
* size of index on disk
* number of index segments
* number of index files
* maybe something else I'm forgetting now

So from size, # of segments, and index files you will be able to tell when 
merges happened and before/after size, segment and index file count.

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




>
> From: "Prakashganesh, Prabhu" 
>To: "solr-user@lucene.apache.org" ; Otis 
>Gospodnetic  
>Sent: Wednesday, May 2, 2012 7:22 AM
>Subject: RE: Solr Merge during off peak times
> 
>Ok, thanks Otis
>Another question on merging
>What is the best way to monitor merging?
>Is there something in the log file that I can look for? 
>It seems like I have to monitor the system resources - read/write IOPS etc.. 
>and work out when a merge happened
>It would be great if I can do it by looking at log files or in the admin UI. 
>Do you know if this can be done or if there is some tool for this?
>
>Thanks
>Prabhu
>
>-Original Message-
>From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
>Sent: 01 May 2012 15:12
>To: solr-user@lucene.apache.org
>Subject: Re: Solr Merge during off peak times
>
>Hi Prabhu,
>
>I don't think such a merge policy exists, but it would be nice to have this 
>option and I imagine it wouldn't be hard to write if you really just base the 
>merge or no merge decision on the time of day (and maybe day of the week).
>
>Note that this should go into Lucene, not Solr, so if you decide to contribute 
>your work, please see http://wiki.apache.org/lucene-java/HowToContribute
>
>Otis
>
>Performance Monitoring for Solr - http://sematext.com/spm
>
>
>
>
>>
>> From: "Prakashganesh, Prabhu" 
>>To: "solr-user@lucene.apache.org"  
>>Sent: Tuesday, May 1, 2012 8:45 AM
>>Subject: Solr Merge during off peak times
>> 
>>Hi,
>>  I would like to know if there is a way to configure index merge policy in 
>>solr so that the merging happens during off peak hours. Can you please let me 
>>know if such a merge policy configuration exists?
>>
>>Thanks
>>Prabhu
>>
>>
>>
>
>
>

RE: synonyms

2012-05-02 Thread Carlos Andres Garcia
Thanks for your answers, now I have another cuestions,if I develop the
filter to replacement the current synonym filter,I understand that this
procces would be in time of the indexing because in time of the query search
there are a lot problems knows. if so, how can I do for create my  index
file.

For example:
I have two synonyms "Nou cam", "Cataluña"  for  barcelona  in the data base


Opcion 1)
In time of the indexing would create 2 records like this:


   barcelona
   Camp Nou
...


and


   barcelona
   Cataluña
...


Opcion 2)

or only would create  one record like this:


   barcelona
   Camp Nou,Cataluña
...



If it create the opcion 2 can looking for by Camp Nou y by Cataluña but when
I looking for by barcelona the Solr return 2 records and that is one error
because barcelona is only one

IF it create the opcion 2 , I have searching wiht wildcards for example *Camp 
Nou* o *Cataluña* y the solr would return one records, the same case if 
searching by barcelona solr would return one recors that is good , but i want 
to know if is the better option or solr have another caracteristic betters  
that can resolve this topic of one better way.



Re: Phrase Slop probelm

2012-05-02 Thread Jack Krupansky
You are missing the "pf", "pf2", and "pf3" request parameters, which says 
which fields to do phrase proximity boosting on.


"pf" boosts using the whole query as a phrase, "pf2" boosts bigrams, and 
"pf3" boost trigrams.


You can use any combination of them, but if you use none of them, "ps" 
appears to be ignored.


Maybe it should default to doing some boost if none of the field lists is 
given, like boost using bigrams in the "qf" fields, but it doesn't.


-- Jack Krupansky

-Original Message- 
From: André Maldonado

Sent: Wednesday, May 02, 2012 3:29 PM
To: solr-user@lucene.apache.org
Subject: Phrase Slop probelm

Hi all.

In my index I have a multivalued field that contains a lot of information,
all text searches are based on it. So, When I Do:

http://xxx.xx.xxx.xxx:/Index/select/?start=0&rows=12&q=term1+term2+term3&qf=textoboost&fq=field1%3aanother_term&defType=edismax&mm=100%25

I got the same result as in:

http://xxx.xx.xxx.xxx:/Index/select/?start=0&rows=12&q=term1+term2+term3
*&ps=0*&qf=textoboost&fq=field1%3aanother_term&defType=edismax&mm=100%25

And the same result in:

http://xxx.xx.xxx.xxx:/Index/select/?start=0&rows=12&q=term1+term2+term3
*&ps=10*
&qf=textoboost&fq=field1%3aanother_term&defType=edismax&mm=100%25

What I'm doing wrong?

Thank's

*
--
*
*"E conhecereis a verdade, e a verdade vos libertará." (João 8:32)*

*andre.maldonado*@gmail.com 
(11) 9112-4227




  


 

  



Re: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: "Severe errors in solr configuration."

2012-05-02 Thread vybe3142
I chronicled exactly what I had to configure to slay this dragon at
http://vinaybalamuru.wordpress.com/2012/04/12/solr4-tomcat-multicor/

Hope that helps

--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-some-help-with-a-multicore-config-of-solr3-6-0-tomcat7-mine-reports-Severe-errors-in-solr-confi-tp3957196p3957389.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: "Severe errors in solr configuration."

2012-05-02 Thread Robert Petersen
I don't know if this will help but I usually add a dataDir element to
each cores solrconfig.xml to point at a local data folder for the core
like this:



${solr.data.dir:./solr/core0/data}


-Original Message-
From: loc...@mm.st [mailto:loc...@mm.st] 
Sent: Wednesday, May 02, 2012 1:06 PM
To: solr-user@lucene.apache.org
Subject: need some help with a multicore config of solr3.6.0+tomcat7.
mine reports: "Severe errors in solr configuration."


i've installed tomcat7 and solr 3.6.0 on linux/64

i'm trying to get a single webapp + multicore setup working.  my efforts
have gone off the rails :-/ i suspect i've followed too many of the
wrong examples.

i'd appreciate some help/direction getting this working.

so far, i've configured

grep   /etc/tomcat7/server.xml -A2 -B2
 Java AJP  Connector: /docs/config/ajp.html
 APR (HTTP/AJP) Connector: /docs/apr.html
 Define a non-SSL HTTP/1.1 Connector on port
 
-->

--

RE: synonyms

2012-05-02 Thread Noordeen, Roxy
Another solution is to write a script to read the database and create the 
synonyms.txt file, dump the file to solr and reload the core.
This gives you the custom synonym solution.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, May 02, 2012 4:54 PM
To: solr-user@lucene.apache.org
Subject: Re: synonyms

I'm not sure I completely follow, but are you simply saying that you want to 
have a synonym filter that reads the synonym table from a database rather 
than the current text file? If so, sure, you could develop a replacement for 
the current synonym filter which loads its table from a database, but you 
would have to develop that code yourself (or get some assistance doing it.)

If that is not what you are trying to do, please explain in a little more 
detail.

-- Jack Krupansky

-Original Message- 
From: Carlos Andres Garcia
Sent: Wednesday, May 02, 2012 4:31 PM
To: solr-user@lucene.apache.org
Subject: synonyms

Hello everbody,

I have a doubt with respect to synonyms in Solr, In our company  we are 
lookink for one solution to resolve synonyms from database and not from one 
text file like SynonymFilterFactory do it.

The idea is save all the synonyms in the database, indexing and  they will 
be ready to one query, but we haven't found one solution from database.

Another idea is create one plugin that extend to SynonymFilterFactory but I 
don't know if this is posible.

I hope someone can help me.

regards,

Carlos Andrés García García 



Re: synonyms

2012-05-02 Thread Jack Krupansky
I'm not sure I completely follow, but are you simply saying that you want to 
have a synonym filter that reads the synonym table from a database rather 
than the current text file? If so, sure, you could develop a replacement for 
the current synonym filter which loads its table from a database, but you 
would have to develop that code yourself (or get some assistance doing it.)


If that is not what you are trying to do, please explain in a little more 
detail.


-- Jack Krupansky

-Original Message- 
From: Carlos Andres Garcia

Sent: Wednesday, May 02, 2012 4:31 PM
To: solr-user@lucene.apache.org
Subject: synonyms

Hello everbody,

I have a doubt with respect to synonyms in Solr, In our company  we are 
lookink for one solution to resolve synonyms from database and not from one 
text file like SynonymFilterFactory do it.


The idea is save all the synonyms in the database, indexing and  they will 
be ready to one query, but we haven't found one solution from database.


Another idea is create one plugin that extend to SynonymFilterFactory but I 
don't know if this is posible.


I hope someone can help me.

regards,

Carlos Andrés García García 



synonyms

2012-05-02 Thread Carlos Andres Garcia
Hello everbody,

I have a doubt with respect to synonyms in Solr, In our company  we are lookink 
for one solution to resolve synonyms from database and not from one text file 
like SynonymFilterFactory do it.

The idea is save all the synonyms in the database, indexing and  they will be 
ready to one query, but we haven't found one solution from database.

Another idea is create one plugin that extend to SynonymFilterFactory but I 
don't know if this is posible.

I hope someone can help me.

regards,

Carlos Andrés García García


Re: Solr Merge during off peak times

2012-05-02 Thread Jason Rutherglen
> BTW, in 4.0, there's DocumentWriterPerThread that
> merges in the background

It flushes without pausing, but does not perform merges.  Maybe you're
thinking of ConcurrentMergeScheduler?

On Wed, May 2, 2012 at 7:26 AM, Erick Erickson  wrote:
> Optimizing is much less important query-speed wise
> than historically, essentially it's not recommended much
> any more.
>
> A significant effect of optimize _used_ to be purging
> obsolete data (i.e. that from deleted docs) from the
> index, but that is now done on merge.
>
> There's no harm in optimizing on off-peak hours, and
> combined with an appropriate merge policy that may make
> indexing a little better (I'm thinking of not doing
> as many massive merges here).
>
> BTW, in 4.0, there's DocumentWriterPerThread that
> merges in the background and pretty much removes
> even this as a motivation for optimizing.
>
> All that said, optimizing isn't _bad_, it's just often
> unnecessary.
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu
>  wrote:
>> Actually we are not thinking of a M/S setup
>> We are planning to have x number of shards on N number of servers, each of 
>> the shard handling both indexing and searching
>> The expected query volume is not that high, so don't think we would need to 
>> replicate to slaves. We think each shard will be able to handle its share of 
>> the indexing and searching. If we need to scale query capacity in future, 
>> yeah probably need to do it by replicating each shard to its slaves
>>
>> I agree autoCommit settings would be good to set up appropriately
>>
>> Another question I had is pros/cons of optimising the index. We would be 
>> purging old content every week and am thinking whether to run an index 
>> optimise in the weekend after purging old data. Because we are going to be 
>> continuously indexing data which would be mix of adds, updates, deletes, not 
>> sure if the benefit of optimising would last long enough to be worth doing 
>> it. Maybe setting a low mergeFactor would be good enough. Optimising makes 
>> sense if the index is more static, perhaps? Thoughts?
>>
>> Thanks
>> Prabhu
>>
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: 02 May 2012 13:15
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> But again, with a master/slave setup merging should
>> be relatively benign. And at 200M docs, having a M/S
>> setup is probably indicated.
>>
>> Here's a good writeup of mergepolicy
>> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/
>>
>> If you're indexing and searching on a single machine, merging
>> is much less important than how often you commit. If a M/S
>> situation, then you're polling interval on the slave is important.
>>
>> I'd look at commit frequency long before I worried about merging,
>> that's usually where people shoot themselves in the foot - by
>> committing too often.
>>
>> Overall, your mergeFactor is probably less important than other
>> parts of how you perform indexing/searching, but it does have
>> some effect for sure...
>>
>> Best
>> Erick
>>
>> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
>>  wrote:
>>> We have a fairly large scale system - about 200 million docs and fairly 
>>> high indexing activity - about 300k docs per day with peak ingestion rates 
>>> of about 20 docs per sec. I want to work out what a good mergeFactor 
>>> setting would be by testing with different mergeFactor settings. I think 
>>> the default of 10 might be high, I want to try with 5 and compare. Unless I 
>>> know when a merge starts and finishes, it would be quite difficult to work 
>>> out the impact of changing mergeFactor. I want to be able to measure how 
>>> long merges take, run queries during the merge activity and see what the 
>>> response times are etc..
>>>
>>> Thanks
>>> Prabhu
>>>
>>> -Original Message-
>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Sent: 02 May 2012 12:40
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr Merge during off peak times
>>>
>>> Why do you care? Merging is generally a background process, or are
>>> you doing heavy indexing? In a master/slave setup,
>>> it's usually not really relevant except that (with 3.x), massive merges
>>> may temporarily stop indexing. Is that the problem?
>>>
>>> Look at the merge policys, there are configurations that make
>>> this less painful.
>>>
>>> In trunk, DocumentWriterPerThread makes merges happen in the
>>> background, which helps the long-pause-while-indexing problem.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>>>  wrote:
 Ok, thanks Otis
 Another question on merging
 What is the best way to monitor merging?
 Is there something in the log file that I can look for?
 It seems like I have to monitor the system resources - read/write IOPS 
 etc.. and work out when a merge happened
 It would be g

need some help with a multicore config of solr3.6.0+tomcat7. mine reports: "Severe errors in solr configuration."

2012-05-02 Thread locuse

i've installed tomcat7 and solr 3.6.0 on linux/64

i'm trying to get a single webapp + multicore setup working.  my efforts
have gone off the rails :-/ i suspect i've followed too many of the
wrong examples.

i'd appreciate some help/direction getting this working.

so far, i've configured

grep   /etc/tomcat7/server.xml -A2 -B2
 Java AJP  Connector: /docs/config/ajp.html
 APR (HTTP/AJP) Connector: /docs/apr.html
 Define a non-SSL HTTP/1.1 Connector on port
 
-->

--

Re: Error with distributed search and Suggester component (Solr 3.4)

2012-05-02 Thread Robert Muir
On Wed, May 2, 2012 at 12:16 PM, Ken Krugler
 wrote:

> What confuses me is that Suggester says it's based on SpellChecker, which 
> supposedly does work with shards.
>

It is based on spellchecker apis, but spellchecker's ranking is based
on simple comparators like string similarity, whereas suggesters use
weights.

when spellchecker merges from shards, it just merges all their top-N
into one set and recomputes this same distance stuff over again.

so, suggester can't possibly work like this correctly (forget about
any technical details), as how can it make assumptions about these
weights you provided. if they were e.g. log() weights from your query
logs then it needs to do log-summation across the shards, etc for the
final combined weight to be correct. This is specific to how you
originally computed the weights you gave it. it certainly cannot be
recomputing anything like spellchecker does :)

Anyways, if you really want to do it, maybe
https://issues.apache.org/jira/browse/SOLR-2848 is helpful. The
background is in 3.x there is really only one spellchecker impl
(AbstractLucene or something like that). I don't think distributed
spellcheck works with any other SpellChecker subclasses in 3.x, i
think its "wired" to only work with the Abstract-Lucene ones.

When we added another subclass to 4.0, DirectSpellChecker, he saw that
it was broken here and cleaned up the APIs so that spellcheckers can
override this merge() operation. Unfortunately I forgot to commit
those refactorings James did (which lets any spellchecker override
merge()ing) to the 3.x branch, but the ideas might be useful.

-- 
lucidimagination.com


Re: question about dates

2012-05-02 Thread Chris Hostetter

: String dateString = "20101230";
: SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
: Date date = sdf.parse(dateString);
: doc.addField("date", date);
: 
: In the index, the date "20101230" is saved as "2010-12-29T23:00:00Z" ( because
: of GMT).

"because of GMT" is missleading and vague ... what you get in your index 
is a value of "2010-12-29T23:00:00Z" because that is the canonical 
string representation of the date object you have passed to doc.addField 
-- the date object you have passed in represents that time, because you 
constructed a SimpleDateFormat object w/o specifying which TimeZone that 
SDF object should assume is in use when it parses it's string input.  So 
when you give it the input "20101230" it treats that is Dec 30, 2010, 
00:00:00.000 in whatever teh local timezone of your client is.

If you want it to treat that input string as a date expression in GMT, 
then you need to configure the parser to use GMT 
(SimpleDateFormat.setTimeZone)

: I tried the following code :
: 
: String dateString = "20101230";
: SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
: Date date = sdf.parse(dateString);
: SimpleDateFormat gmtSdf = new
: SimpleDateFormat("-MM-dd'T'HH\\:mm\\:ss'Z'");
: String gmtString = gmtSdf.format(Date);
: 
: The problem is that gmtString is equals to "2010-12-30T00\:00\:00Z". There is

again, that is not a "gmtString" .. in this case, both of the SDF objects 
you are using have not been configured with an explicit TimeZone, so they 
use whatever hte platform default is where this code is run -- so the 
variable you are calling "gmtString" is actaully a string representation 
of Date object formated in your local TimeZone.

Bottom line...

* when parsing a string into a Date, you really need to know (and be 
explicit to the parser) about what timezone is represented in that string 
(unless the formated of hte string includes the TimeZone)

* when building a query string to pass to solr, then the DateFormat 
you use to formate a Date object must format it using GMT -- there is a 
DateUtil class included in solrj to make this easier.

If you really don't care at all about TimeZones, then just use GMT 
everywhere .. but if you actually care about what time of day something 
happened, and want to be able to query for events with hour/min/sec/etc.. 
granularity, then you need to be precise about the TimeZone in every 
Formatter you use.


-Hoss


Dynamic core creation works in 3.5.0 fails in 3.6.0: At least one core definition required at run-time for Solr 3.6.0?

2012-05-02 Thread Emes, Matthew (US - Irvine)
Hi:

I have been working on an integration project involving Solr 3.5.0 that
dynamically registers cores as needed at run-time, but does not contain any
cores by default. The current solr.xml configuration file is:-



  


This configuration does not include any cores as those are created
dynamically by each application that is using the Solr server. This is
working fine with Solr 3.5.0; the server starts and running web
applications can register a new core using SolrJ CoreAdminRequest and
everything is working correctly. However, I tried to update to Solr 3.6.0
and this configuration fails with a SolrException due to the following code
in CoreContainer.java (lines 171-173):-

if (cores.cores.isEmpty()){
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "No cores
were created, please check the logs for errors");
}

This is a change from Solr 3.5.0 which has no such check. I have searched
but cannot find any ticket or notice that this is a planned change in
3.6.0, but before I file a ticket I am asking the community in case this is
an issue that has been discussed and this is a planned direction for Solr.

Thanks,

Matthew


Re: ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
How interesting! You know, I did at one point consider that perhaps the
fieldname "meta" may be treated specially, but I talked myself out of it. I
reasoned that a field name in my local schema should have no bearing on how
a plugin such as solr-cell/Tika behaves. I should have tested my
hypothesis; even if this phenomenon turns out to be undocumented behavior,
I consider myself a victim of my own assumptions.

I am running version 3.5. You may have gotten the multivalue errors due to
the way your test schema and/or extracting request handler is lain out (my
bad). I am using the "ignored" fieldtype and a dynamicField called
"ignored_" as a catch-all for extraneous fields delivered by Tika.

Thanks for your help! Please keep me posted on any further
insights/revelations, and I'll do the same.

On Wed, May 2, 2012 at 12:54 PM, Jack Krupansky wrote:

> I did some testing, and evidently the "meta" field is treated specially
> from the ERH.
>
> I copied the example schema, and added both "meta" and "metax" fields and
> set "fmap.content=metax", and lo and behold only the doc content appears in
> "metax", but all the doc metadata appears in "meta".
>
> Although, I did get 400 errors with Solr complaining that "meta" was not a
> multivalued field. This is with Solr 3.6. What release of Solr are you
> using?
>
> I was not aware of this undocumented feature. I haven't checked the code
> yet.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Joseph Hagerty
> Sent: Wednesday, May 02, 2012 11:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtractRH: How to strip metadata
>
>
> I do not. I commented out all of the copyFields provided in the default
> schema.xml that ships with 3.5. My schema is rather minimal. Here is my
> fields block, if this helps:
>
> 
>   required="true"  />
>   required="true"  />
>   required="true"  />
>   required="true"  />
>  
>  
> 
>
>
> On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky *
> *wrote:
>
>  Check to see if you have a CopyField for a wildcard pattern that copies to
>> "meta", which would copy all of the Tika-generated fields to "meta."
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Joseph Hagerty
>> Sent: Wednesday, May 02, 2012 9:56 AM
>> To: solr-user@lucene.apache.org
>> Subject: ExtractRH: How to strip metadata
>>
>>
>> Greetings Solr folk,
>>
>> How can I instruct the extract request handler to ignore metadata/headers
>> etc. when it constructs the "content" of the document I send to it?
>>
>> For example, I created an MS Word document containing just the word
>> "SEARCHWORD" and nothing else. However, when I ship this doc to my solr
>> server, here's what's thrown in the index:
>>
>> 
>> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
>> stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
>> Page-Count 1 subject Application-Name Microsoft Macintosh Word Author
>> Jesus
>> Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
>> 2008-11-05T20:19:00Z stream_content_type application/octet-stream
>> Character
>> Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
>>
>> phpHCIg7y
>> Company Parkman Elastomers Pvt Ltd Content-Type application/msword
>> Keywords
>> Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
>> 
>>
>> All I want is the body of the document, in this case the word
>> "SEARCHWORD."
>>
>> For further reference, here's my extraction handler:
>>
>> >startup="lazy"
>>class="solr.extraction.ExtractingRequestHandler" >
>>
>>  
>>
>>meta
>>true
>>ignored_
>>  
>>  
>>
>> (Ironically, "meta" is the field in the solr schema to which I'm
>> attempting
>> to extract the body of the document. Don't ask).
>>
>> Thanks in advance for any pointers you can provide me.
>>
>> --
>> - Joe
>>
>>
>
>
> --
> - Joe
>



-- 
- Joe


Re: Removing old documents

2012-05-02 Thread alxsss

 

 I use jetty that comes with solr. 
I use solr's dedupe


   
 true
 id
 true
 url
 solr.processor.Lookup3Signature
   
   
   
 


and because of this id is not url itself but its encoded signature.

I see solrclean uses url to delete a document.

Is it possible that the issue is because of this mismatch?


Thanks.
Alex.


 

-Original Message-
From: Paul Libbrecht 
To: solr-user 
Sent: Tue, May 1, 2012 11:43 pm
Subject: Re: Removing old documents


With which client?

paul


Le 2 mai 2012 à 01:29, alx...@aim.com a écrit :

> all caching is disabled and I restarted jetty. The same results.


 


Re: Dumb question: Streaming collector /query results

2012-05-02 Thread Mikhail Khludnev
I did small research with the fairly modest result
https://github.com/m-khl/solr-patches/tree/streaming

you can start exploring it from the trivial test
https://github.com/m-khl/solr-patches/blob/17cd45ce7693284de08d39ebc8812aa6a20b8fb3/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java

pls let me know whether it's useful for you.

On Wed, May 2, 2012 at 6:48 PM, vybe3142  wrote:

> her words, .. as an alternative , what's the most efficient way to gain
> access to all of the document ids that match a qu
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics


 


Re: ExtractRH: How to strip metadata

2012-05-02 Thread Jack Krupansky
I did some testing, and evidently the "meta" field is treated specially from 
the ERH.


I copied the example schema, and added both "meta" and "metax" fields and 
set "fmap.content=metax", and lo and behold only the doc content appears in 
"metax", but all the doc metadata appears in "meta".


Although, I did get 400 errors with Solr complaining that "meta" was not a 
multivalued field. This is with Solr 3.6. What release of Solr are you 
using?


I was not aware of this undocumented feature. I haven't checked the code 
yet.


-- Jack Krupansky

-Original Message- 
From: Joseph Hagerty

Sent: Wednesday, May 02, 2012 11:10 AM
To: solr-user@lucene.apache.org
Subject: Re: ExtractRH: How to strip metadata

I do not. I commented out all of the copyFields provided in the default
schema.xml that ships with 3.5. My schema is rather minimal. Here is my
fields block, if this helps:


  
  
  
  
  
  



On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky 
wrote:



Check to see if you have a CopyField for a wildcard pattern that copies to
"meta", which would copy all of the Tika-generated fields to "meta."

-- Jack Krupansky

-Original Message- From: Joseph Hagerty
Sent: Wednesday, May 02, 2012 9:56 AM
To: solr-user@lucene.apache.org
Subject: ExtractRH: How to strip metadata


Greetings Solr folk,

How can I instruct the extract request handler to ignore metadata/headers
etc. when it constructs the "content" of the document I send to it?

For example, I created an MS Word document containing just the word
"SEARCHWORD" and nothing else. However, when I ship this doc to my solr
server, here's what's thrown in the index:


Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
Page-Count 1 subject Application-Name Microsoft Macintosh Word Author 
Jesus

Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
2008-11-05T20:19:00Z stream_content_type application/octet-stream 
Character

Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
phpHCIg7y
Company Parkman Elastomers Pvt Ltd Content-Type application/msword 
Keywords

Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD


All I want is the body of the document, in this case the word 
"SEARCHWORD."


For further reference, here's my extraction handler:


  

meta
true
ignored_
  
 

(Ironically, "meta" is the field in the solr schema to which I'm 
attempting

to extract the body of the document. Don't ask).

Thanks in advance for any pointers you can provide me.

--
- Joe





--
- Joe 



Solr 3.5 - Elevate.xml causing issues when placed under /data directory

2012-05-02 Thread Noordeen, Roxy
Hello,
I just started using elevation for solr. I am on solr 3.5, running with Drupal 
7, Linux.

1. I updated my solrconfig.xml
from
${solr.data.dir:./solr/data}

To
/usr/local/tomcat2/data/solr/dev_d7/data

2. I placed my elevate.xml in my solr's data directory. Based on forum answers, 
I thought placing elevate.xml under data directory would pick my latest change.
I restarted tomcat.

3.  When i placed my elevate.xml under conf directory, elevation was working 
with url:

http://mysolr.www.com:8181/solr/elevate?q=games&wt=xml&sort=score+desc&fl=id,bundle_name

But when i moved to data directory,  I am not seeing any results.

NOTE: I can see the catalina.out, printing solr reading the file from data 
directory. I tried to give invalid entries; I noticed solr errors parsing 
elevate.xml from data directory. I even tried to send some documents to index, 
thought commit might help to read the elevate config file. But nothing helped.

I don't understand why below url does not work anymore.  There are no errors in 
the log files.

http://mysolr.www.com:8181/solr/elevate?q=games&wt=xml&sort=score+desc&fl=id,bundle_name


Any help on this topic is appreciated.


Thanks



Re: Error with distributed search and Suggester component (Solr 3.4)

2012-05-02 Thread Ken Krugler
Hi Robert,

On May 1, 2012, at 7:07pm, Robert Muir wrote:

> On Tue, May 1, 2012 at 6:48 PM, Ken Krugler  
> wrote:
>> Hi list,
>> 
>> Does anybody know if the Suggester component is designed to work with shards?
> 
> I'm not really sure it is? They would probably have to override the
> default merge implementation specified by SpellChecker.

What confuses me is that Suggester says it's based on SpellChecker, which 
supposedly does work with shards.

> But, all of the current suggesters pump out over 100,000 QPS on my
> machine, so I'm wondering what the usefulness of this is?
> 
> And if it was useful, merging results from different machines is
> pretty inefficient, for suggest you would shard by term instead so
> that you need only contact a single host?

The issue is that I've got a configuration with 8 shards already that I'm 
trying to leverage for auto-complete.

My quick & dirty work-around would be to add a custom response handler that 
wraps the suggester, and returns results with the fields that the SearchHandler 
needs to do the merge.

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: question about dates

2012-05-02 Thread Jack Krupansky
That wasn't right either... the query must have the trailing Z, which Solr 
will strip off to match the indexed value which doesn't have the Z. So, my 
corrected original statement is:


The trailing "Z" is required in your input data to be indexed, but the Z is 
not actually indexed by Solr (it is stripped), although the stored value of 
the field, if any, would have the original value with the Z. Your query must 
have the trailing "Z" though (which Solr will strip off), unless you are 
doing a wildcard or prefix query.


Sorry about that.

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, May 02, 2012 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: question about dates

Oops... I meant to say that Solr doesn't *index* the trailing Z, but it is
"stored" (the stored value, not the indexed value.) The query must match the
indexed value, not the stored value.

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, May 02, 2012 11:55 AM
To: solr-user@lucene.apache.org
Subject: Re: question about dates

The trailing "Z" is required in your input data to be indexed, but the Z is
not actually stored. Your query must have the trailing "Z" though, unless
you are doing a wildcard or prefix query.

-- Jack Krupansky

-Original Message- 
From: G.Long

Sent: Wednesday, May 02, 2012 11:18 AM
To: solr-user@lucene.apache.org
Subject: question about dates

Hi :)

I'm starting to use Solr and I'm facing a little problem with dates. My
documents have a date property which is of type 'MMdd'.

To index these dates, I use the following code:

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
doc.addField("date", date);

In the index, the date "20101230" is saved as "2010-12-29T23:00:00Z" (
because of GMT).

Now I would like to query documents which have their date property
equals to "20101230" but I don't know how to handle this.

I tried the following code :

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
SimpleDateFormat gmtSdf = new
SimpleDateFormat("-MM-dd'T'HH\\:mm\\:ss'Z'");
String gmtString = gmtSdf.format(Date);

The problem is that gmtString is equals to "2010-12-30T00\:00\:00Z".
There is a difference between the index value and the parameter value of
my query : /.

I see that there might be something to do with the timezones during the
date to string  and string to date conversion but I can't find it.

Thanks,

Gary





Re: question about dates

2012-05-02 Thread Jack Krupansky
Oops... I meant to say that Solr doesn't *index* the trailing Z, but it is 
"stored" (the stored value, not the indexed value.) The query must match the 
indexed value, not the stored value.


-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, May 02, 2012 11:55 AM
To: solr-user@lucene.apache.org
Subject: Re: question about dates

The trailing "Z" is required in your input data to be indexed, but the Z is
not actually stored. Your query must have the trailing "Z" though, unless
you are doing a wildcard or prefix query.

-- Jack Krupansky

-Original Message- 
From: G.Long

Sent: Wednesday, May 02, 2012 11:18 AM
To: solr-user@lucene.apache.org
Subject: question about dates

Hi :)

I'm starting to use Solr and I'm facing a little problem with dates. My
documents have a date property which is of type 'MMdd'.

To index these dates, I use the following code:

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
doc.addField("date", date);

In the index, the date "20101230" is saved as "2010-12-29T23:00:00Z" (
because of GMT).

Now I would like to query documents which have their date property
equals to "20101230" but I don't know how to handle this.

I tried the following code :

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
SimpleDateFormat gmtSdf = new
SimpleDateFormat("-MM-dd'T'HH\\:mm\\:ss'Z'");
String gmtString = gmtSdf.format(Date);

The problem is that gmtString is equals to "2010-12-30T00\:00\:00Z".
There is a difference between the index value and the parameter value of
my query : /.

I see that there might be something to do with the timezones during the
date to string  and string to date conversion but I can't find it.

Thanks,

Gary






SOLRJ: Is there a way to obtain a quick count of total results for a query

2012-05-02 Thread vybe3142
I can achieve this by building a query with start and rows = 0, and using
.getResults().getNumFound().

Are there any more efficient approaches to this?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLRJ-Is-there-a-way-to-obtain-a-quick-count-of-total-results-for-a-query-tp3955322.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about dates

2012-05-02 Thread Jack Krupansky
The trailing "Z" is required in your input data to be indexed, but the Z is 
not actually stored. Your query must have the trailing "Z" though, unless 
you are doing a wildcard or prefix query.


-- Jack Krupansky

-Original Message- 
From: G.Long

Sent: Wednesday, May 02, 2012 11:18 AM
To: solr-user@lucene.apache.org
Subject: question about dates

Hi :)

I'm starting to use Solr and I'm facing a little problem with dates. My
documents have a date property which is of type 'MMdd'.

To index these dates, I use the following code:

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
doc.addField("date", date);

In the index, the date "20101230" is saved as "2010-12-29T23:00:00Z" (
because of GMT).

Now I would like to query documents which have their date property
equals to "20101230" but I don't know how to handle this.

I tried the following code :

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
SimpleDateFormat gmtSdf = new
SimpleDateFormat("-MM-dd'T'HH\\:mm\\:ss'Z'");
String gmtString = gmtSdf.format(Date);

The problem is that gmtString is equals to "2010-12-30T00\:00\:00Z".
There is a difference between the index value and the parameter value of
my query : /.

I see that there might be something to do with the timezones during the
date to string  and string to date conversion but I can't find it.

Thanks,

Gary







question about dates

2012-05-02 Thread G.Long

Hi :)

I'm starting to use Solr and I'm facing a little problem with dates. My 
documents have a date property which is of type 'MMdd'.


To index these dates, I use the following code:

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
doc.addField("date", date);

In the index, the date "20101230" is saved as "2010-12-29T23:00:00Z" ( 
because of GMT).


Now I would like to query documents which have their date property 
equals to "20101230" but I don't know how to handle this.


I tried the following code :

String dateString = "20101230";
SimpleDateFormat sdf = new SimpleDateFormat("MMdd");
Date date = sdf.parse(dateString);
SimpleDateFormat gmtSdf = new 
SimpleDateFormat("-MM-dd'T'HH\\:mm\\:ss'Z'");

String gmtString = gmtSdf.format(Date);

The problem is that gmtString is equals to "2010-12-30T00\:00\:00Z". 
There is a difference between the index value and the parameter value of 
my query : /.


I see that there might be something to do with the timezones during the 
date to string  and string to date conversion but I can't find it.


Thanks,

Gary







Re: ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
I do not. I commented out all of the copyFields provided in the default
schema.xml that ships with 3.5. My schema is rather minimal. Here is my
fields block, if this helps:

 
   
   
   
   
   
   
 


On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky wrote:

> Check to see if you have a CopyField for a wildcard pattern that copies to
> "meta", which would copy all of the Tika-generated fields to "meta."
>
> -- Jack Krupansky
>
> -Original Message- From: Joseph Hagerty
> Sent: Wednesday, May 02, 2012 9:56 AM
> To: solr-user@lucene.apache.org
> Subject: ExtractRH: How to strip metadata
>
>
> Greetings Solr folk,
>
> How can I instruct the extract request handler to ignore metadata/headers
> etc. when it constructs the "content" of the document I send to it?
>
> For example, I created an MS Word document containing just the word
> "SEARCHWORD" and nothing else. However, when I ship this doc to my solr
> server, here's what's thrown in the index:
>
> 
> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
> stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
> Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
> Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
> 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
> Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
> phpHCIg7y
> Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
> Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
> 
>
> All I want is the body of the document, in this case the word "SEARCHWORD."
>
> For further reference, here's my extraction handler:
>
>  startup="lazy"
> class="solr.extraction.**ExtractingRequestHandler" >
>   
> 
> meta
> true
> ignored_
>   
>  
>
> (Ironically, "meta" is the field in the solr schema to which I'm attempting
> to extract the body of the document. Don't ask).
>
> Thanks in advance for any pointers you can provide me.
>
> --
> - Joe
>



-- 
- Joe


Re: ExtractRH: How to strip metadata

2012-05-02 Thread Jack Krupansky
Check to see if you have a CopyField for a wildcard pattern that copies to 
"meta", which would copy all of the Tika-generated fields to "meta."


-- Jack Krupansky

-Original Message- 
From: Joseph Hagerty

Sent: Wednesday, May 02, 2012 9:56 AM
To: solr-user@lucene.apache.org
Subject: ExtractRH: How to strip metadata

Greetings Solr folk,

How can I instruct the extract request handler to ignore metadata/headers
etc. when it constructs the "content" of the document I send to it?

For example, I created an MS Word document containing just the word
"SEARCHWORD" and nothing else. However, when I ship this doc to my solr
server, here's what's thrown in the index:


Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y
Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD


All I want is the body of the document, in this case the word "SEARCHWORD."

For further reference, here's my extraction handler:


   
 
 meta
 true
 ignored_
   
 

(Ironically, "meta" is the field in the solr schema to which I'm attempting
to extract the body of the document. Don't ask).

Thanks in advance for any pointers you can provide me.

--
- Joe 



Re: Dumb question: Streaming collector /query results

2012-05-02 Thread vybe3142
In other words, .. as an alternative , what's the most efficient way to gain
access to all of the document ids that match a query

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dumb-question-Streaming-collector-query-results-tp3955175p3955194.html
Sent from the Solr - User mailing list archive at Nabble.com.


Dumb question: Streaming collector /query results

2012-05-02 Thread vybe3142
I doubt if SOLR has this capability , given that it is based on a RESTful
architecture, but I wanted to ask in case I'm mistaken.

In lucene, it is easier to gain a direct handle to the collector / scorer
and access all the results as they're collected (as opposed to the SOLR
query call that performs the same internally but returns only a subset of
results based on the spec'd number of results and offset from the first
result)

What are my options if I want to access results as they're generated?
My first thought would be to write a custom collector to handle the hits as
they're scored.

Thanks








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dumb-question-Streaming-collector-query-results-tp3955175.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Merge during off peak times

2012-05-02 Thread Erick Erickson
Optimizing is much less important query-speed wise
than historically, essentially it's not recommended much
any more.

A significant effect of optimize _used_ to be purging
obsolete data (i.e. that from deleted docs) from the
index, but that is now done on merge.

There's no harm in optimizing on off-peak hours, and
combined with an appropriate merge policy that may make
indexing a little better (I'm thinking of not doing
as many massive merges here).

BTW, in 4.0, there's DocumentWriterPerThread that
merges in the background and pretty much removes
even this as a motivation for optimizing.

All that said, optimizing isn't _bad_, it's just often
unnecessary.

Best
Erick

On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu
 wrote:
> Actually we are not thinking of a M/S setup
> We are planning to have x number of shards on N number of servers, each of 
> the shard handling both indexing and searching
> The expected query volume is not that high, so don't think we would need to 
> replicate to slaves. We think each shard will be able to handle its share of 
> the indexing and searching. If we need to scale query capacity in future, 
> yeah probably need to do it by replicating each shard to its slaves
>
> I agree autoCommit settings would be good to set up appropriately
>
> Another question I had is pros/cons of optimising the index. We would be 
> purging old content every week and am thinking whether to run an index 
> optimise in the weekend after purging old data. Because we are going to be 
> continuously indexing data which would be mix of adds, updates, deletes, not 
> sure if the benefit of optimising would last long enough to be worth doing 
> it. Maybe setting a low mergeFactor would be good enough. Optimising makes 
> sense if the index is more static, perhaps? Thoughts?
>
> Thanks
> Prabhu
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 May 2012 13:15
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> But again, with a master/slave setup merging should
> be relatively benign. And at 200M docs, having a M/S
> setup is probably indicated.
>
> Here's a good writeup of mergepolicy
> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/
>
> If you're indexing and searching on a single machine, merging
> is much less important than how often you commit. If a M/S
> situation, then you're polling interval on the slave is important.
>
> I'd look at commit frequency long before I worried about merging,
> that's usually where people shoot themselves in the foot - by
> committing too often.
>
> Overall, your mergeFactor is probably less important than other
> parts of how you perform indexing/searching, but it does have
> some effect for sure...
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
>  wrote:
>> We have a fairly large scale system - about 200 million docs and fairly high 
>> indexing activity - about 300k docs per day with peak ingestion rates of 
>> about 20 docs per sec. I want to work out what a good mergeFactor setting 
>> would be by testing with different mergeFactor settings. I think the default 
>> of 10 might be high, I want to try with 5 and compare. Unless I know when a 
>> merge starts and finishes, it would be quite difficult to work out the 
>> impact of changing mergeFactor. I want to be able to measure how long merges 
>> take, run queries during the merge activity and see what the response times 
>> are etc..
>>
>> Thanks
>> Prabhu
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: 02 May 2012 12:40
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> Why do you care? Merging is generally a background process, or are
>> you doing heavy indexing? In a master/slave setup,
>> it's usually not really relevant except that (with 3.x), massive merges
>> may temporarily stop indexing. Is that the problem?
>>
>> Look at the merge policys, there are configurations that make
>> this less painful.
>>
>> In trunk, DocumentWriterPerThread makes merges happen in the
>> background, which helps the long-pause-while-indexing problem.
>>
>> Best
>> Erick
>>
>> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>>  wrote:
>>> Ok, thanks Otis
>>> Another question on merging
>>> What is the best way to monitor merging?
>>> Is there something in the log file that I can look for?
>>> It seems like I have to monitor the system resources - read/write IOPS 
>>> etc.. and work out when a merge happened
>>> It would be great if I can do it by looking at log files or in the admin 
>>> UI. Do you know if this can be done or if there is some tool for this?
>>>
>>> Thanks
>>> Prabhu
>>>
>>> -Original Message-
>>> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
>>> Sent: 01 May 2012 15:12
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr Merge during off peak times
>>>
>>> H

Re: Removing old documents

2012-05-02 Thread Bai Shen
Somehow I missed that there was a solrclean command.  Thanks.

On Tue, May 1, 2012 at 10:41 AM, Markus Jelsma
wrote:

> Nutch 1.4 has a separate tool to remove 404 and redirects documents from
> your
> index based on your CrawlDB. Trunk's SolrIndexer can add and remove
> documents
> in one run based on segment data.
>
> On Tuesday 01 May 2012 16:31:47 Bai Shen wrote:
> > I'm running Nutch, so it's updating the documents, but I'm wanting to
> > remove ones that are no longer available.  So in that case, there's no
> > update possible.
> >
> > On Tue, May 1, 2012 at 8:47 AM, mav.p...@holidaylettings.co.uk <
> >
> > mav.p...@holidaylettings.co.uk> wrote:
> > > Not sure if there is an automatic way but we do it via a delete query
> and
> > > where possible we update doc under same id to avoid deletes.
> > >
> > > On 01/05/2012 13:43, "Bai Shen"  wrote:
> > > >What is the best method to remove old documents?  Things that no
> > > >generate 404 errors, etc.
> > > >
> > > >Is there an automatic method or do I have to do it manually?
> > > >
> > > >THanks.
>
> --
> Markus Jelsma - CTO - Openindex
>


ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
Greetings Solr folk,

How can I instruct the extract request handler to ignore metadata/headers
etc. when it constructs the "content" of the document I send to it?

For example, I created an MS Word document containing just the word
"SEARCHWORD" and nothing else. However, when I ship this doc to my solr
server, here's what's thrown in the index:


Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y
Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD


All I want is the body of the document, in this case the word "SEARCHWORD."

For further reference, here's my extraction handler:

 

  
  meta
  true
  ignored_

  

(Ironically, "meta" is the field in the solr schema to which I'm attempting
to extract the body of the document. Don't ask).

Thanks in advance for any pointers you can provide me.

-- 
- Joe


RE: Solr Merge during off peak times

2012-05-02 Thread Prakashganesh, Prabhu
Actually we are not thinking of a M/S setup
We are planning to have x number of shards on N number of servers, each of the 
shard handling both indexing and searching
The expected query volume is not that high, so don't think we would need to 
replicate to slaves. We think each shard will be able to handle its share of 
the indexing and searching. If we need to scale query capacity in future, yeah 
probably need to do it by replicating each shard to its slaves

I agree autoCommit settings would be good to set up appropriately

Another question I had is pros/cons of optimising the index. We would be 
purging old content every week and am thinking whether to run an index optimise 
in the weekend after purging old data. Because we are going to be continuously 
indexing data which would be mix of adds, updates, deletes, not sure if the 
benefit of optimising would last long enough to be worth doing it. Maybe 
setting a low mergeFactor would be good enough. Optimising makes sense if the 
index is more static, perhaps? Thoughts?

Thanks
Prabhu 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 May 2012 13:15
To: solr-user@lucene.apache.org
Subject: Re: Solr Merge during off peak times

But again, with a master/slave setup merging should
be relatively benign. And at 200M docs, having a M/S
setup is probably indicated.

Here's a good writeup of mergepolicy
http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/

If you're indexing and searching on a single machine, merging
is much less important than how often you commit. If a M/S
situation, then you're polling interval on the slave is important.

I'd look at commit frequency long before I worried about merging,
that's usually where people shoot themselves in the foot - by
committing too often.

Overall, your mergeFactor is probably less important than other
parts of how you perform indexing/searching, but it does have
some effect for sure...

Best
Erick

On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
 wrote:
> We have a fairly large scale system - about 200 million docs and fairly high 
> indexing activity - about 300k docs per day with peak ingestion rates of 
> about 20 docs per sec. I want to work out what a good mergeFactor setting 
> would be by testing with different mergeFactor settings. I think the default 
> of 10 might be high, I want to try with 5 and compare. Unless I know when a 
> merge starts and finishes, it would be quite difficult to work out the impact 
> of changing mergeFactor. I want to be able to measure how long merges take, 
> run queries during the merge activity and see what the response times are 
> etc..
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 May 2012 12:40
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Why do you care? Merging is generally a background process, or are
> you doing heavy indexing? In a master/slave setup,
> it's usually not really relevant except that (with 3.x), massive merges
> may temporarily stop indexing. Is that the problem?
>
> Look at the merge policys, there are configurations that make
> this less painful.
>
> In trunk, DocumentWriterPerThread makes merges happen in the
> background, which helps the long-pause-while-indexing problem.
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>  wrote:
>> Ok, thanks Otis
>> Another question on merging
>> What is the best way to monitor merging?
>> Is there something in the log file that I can look for?
>> It seems like I have to monitor the system resources - read/write IOPS etc.. 
>> and work out when a merge happened
>> It would be great if I can do it by looking at log files or in the admin UI. 
>> Do you know if this can be done or if there is some tool for this?
>>
>> Thanks
>> Prabhu
>>
>> -Original Message-
>> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
>> Sent: 01 May 2012 15:12
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> Hi Prabhu,
>>
>> I don't think such a merge policy exists, but it would be nice to have this 
>> option and I imagine it wouldn't be hard to write if you really just base 
>> the merge or no merge decision on the time of day (and maybe day of the 
>> week).
>>
>> Note that this should go into Lucene, not Solr, so if you decide to 
>> contribute your work, please 
>> see http://wiki.apache.org/lucene-java/HowToContribute
>>
>> Otis
>> 
>> Performance Monitoring for Solr - http://sematext.com/spm
>>
>>
>>
>>
>>>
>>> From: "Prakashganesh, Prabhu" 
>>>To: "solr-user@lucene.apache.org" 
>>>Sent: Tuesday, May 1, 2012 8:45 AM
>>>Subject: Solr Merge during off peak times
>>>
>>>Hi,
>>>  I would like to know if there is a way to configure index merge policy in 
>>>solr so that the merging happens during off peak hours. Can you please let 
>>>me kno

Re: Newbie question on sorting

2012-05-02 Thread Jacek
Erick, I'll do that. Thank you very much.

Regards,
Jacek

On Tue, May 1, 2012 at 7:19 AM, Erick Erickson wrote:

> The easiest way is to do that in the app. That is, return the top
> 10 to the app (by score) then re-order them there. There's nothing
> in Solr that I know of that does what you want out of the box.
>
> Best
> Erick
>
> On Mon, Apr 30, 2012 at 11:10 AM, Jacek  wrote:
> > Hello all,
> >
> > I'm facing this simple problem, yet impossible to resolve for me (I'm a
> > newbie in Solr).
> > I need to sort the results by score (it is simple, of course), but then
> > what I need is to take top 10 results, and re-order it (only those top 10
> > results) by a date field.
> > It's not the same as sort=score,creationdate
> >
> > Any suggestions will be greatly appreciated!
>


Null Pointer Exception in SOLR

2012-05-02 Thread mechravi25
Hi,


When I tried to remove a data from UI (which will in turn hit SOLR), the
whole application got stuck up. When we took the log files of the UI, we
could see that this set of requests did not reach SOLR itself. In the SOLR
log file, we were able to find the following exception occuring at the same
time.


SEVERE: org.apache.solr.common.SolrException:
null__javalangNullPointerException_

null__javalangNullPointerException_

request: http://solr/coreX/select   
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request
at org.apache.solr.handler.component.HttpCommComponent$1.call
at org.apache.solr.handler.component.HttpCommComponent$1.call
at java.util.concurrent.FutureTask$Sync.innerRun
at java.util.concurrent.FutureTask.run
at java.util.concurrent.Executors$RunnableAdapter.call
at java.util.concurrent.FutureTask$Sync.innerRun
at java.util.concurrent.FutureTask.run
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
at java.util.concurrent.ThreadPoolExecutor$Worker.run
at java.lang.Thread.run


This situation resulted for another few hours. No one was able to perform
any operation with the application and If any one tried to perform any
action, it resulted in the above exception during that period. But, this
situation resolved by itself after few hours and it started working like
normal. Can you tell me if this situation was due to deadlock condition or
was it due to the CPU utilization going beyond 100%? If it was due to the
deadloack, then why did we not get any such messages in the log files?Or is
it due to some other problem?Am I missing anything? Can you guide me on
this?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-in-SOLR-tp3954952.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Merge during off peak times

2012-05-02 Thread Erick Erickson
But again, with a master/slave setup merging should
be relatively benign. And at 200M docs, having a M/S
setup is probably indicated.

Here's a good writeup of mergepolicy
http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/

If you're indexing and searching on a single machine, merging
is much less important than how often you commit. If a M/S
situation, then you're polling interval on the slave is important.

I'd look at commit frequency long before I worried about merging,
that's usually where people shoot themselves in the foot - by
committing too often.

Overall, your mergeFactor is probably less important than other
parts of how you perform indexing/searching, but it does have
some effect for sure...

Best
Erick

On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
 wrote:
> We have a fairly large scale system - about 200 million docs and fairly high 
> indexing activity - about 300k docs per day with peak ingestion rates of 
> about 20 docs per sec. I want to work out what a good mergeFactor setting 
> would be by testing with different mergeFactor settings. I think the default 
> of 10 might be high, I want to try with 5 and compare. Unless I know when a 
> merge starts and finishes, it would be quite difficult to work out the impact 
> of changing mergeFactor. I want to be able to measure how long merges take, 
> run queries during the merge activity and see what the response times are 
> etc..
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 May 2012 12:40
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Why do you care? Merging is generally a background process, or are
> you doing heavy indexing? In a master/slave setup,
> it's usually not really relevant except that (with 3.x), massive merges
> may temporarily stop indexing. Is that the problem?
>
> Look at the merge policys, there are configurations that make
> this less painful.
>
> In trunk, DocumentWriterPerThread makes merges happen in the
> background, which helps the long-pause-while-indexing problem.
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>  wrote:
>> Ok, thanks Otis
>> Another question on merging
>> What is the best way to monitor merging?
>> Is there something in the log file that I can look for?
>> It seems like I have to monitor the system resources - read/write IOPS etc.. 
>> and work out when a merge happened
>> It would be great if I can do it by looking at log files or in the admin UI. 
>> Do you know if this can be done or if there is some tool for this?
>>
>> Thanks
>> Prabhu
>>
>> -Original Message-
>> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
>> Sent: 01 May 2012 15:12
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> Hi Prabhu,
>>
>> I don't think such a merge policy exists, but it would be nice to have this 
>> option and I imagine it wouldn't be hard to write if you really just base 
>> the merge or no merge decision on the time of day (and maybe day of the 
>> week).
>>
>> Note that this should go into Lucene, not Solr, so if you decide to 
>> contribute your work, please 
>> see http://wiki.apache.org/lucene-java/HowToContribute
>>
>> Otis
>> 
>> Performance Monitoring for Solr - http://sematext.com/spm
>>
>>
>>
>>
>>>
>>> From: "Prakashganesh, Prabhu" 
>>>To: "solr-user@lucene.apache.org" 
>>>Sent: Tuesday, May 1, 2012 8:45 AM
>>>Subject: Solr Merge during off peak times
>>>
>>>Hi,
>>>  I would like to know if there is a way to configure index merge policy in 
>>>solr so that the merging happens during off peak hours. Can you please let 
>>>me know if such a merge policy configuration exists?
>>>
>>>Thanks
>>>Prabhu
>>>
>>>
>>>


RE: Solr Merge during off peak times

2012-05-02 Thread Prakashganesh, Prabhu
We have a fairly large scale system - about 200 million docs and fairly high 
indexing activity - about 300k docs per day with peak ingestion rates of about 
20 docs per sec. I want to work out what a good mergeFactor setting would be by 
testing with different mergeFactor settings. I think the default of 10 might be 
high, I want to try with 5 and compare. Unless I know when a merge starts and 
finishes, it would be quite difficult to work out the impact of changing 
mergeFactor. I want to be able to measure how long merges take, run queries 
during the merge activity and see what the response times are etc..

Thanks
Prabhu

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 May 2012 12:40
To: solr-user@lucene.apache.org
Subject: Re: Solr Merge during off peak times

Why do you care? Merging is generally a background process, or are
you doing heavy indexing? In a master/slave setup,
it's usually not really relevant except that (with 3.x), massive merges
may temporarily stop indexing. Is that the problem?

Look at the merge policys, there are configurations that make
this less painful.

In trunk, DocumentWriterPerThread makes merges happen in the
background, which helps the long-pause-while-indexing problem.

Best
Erick

On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
 wrote:
> Ok, thanks Otis
> Another question on merging
> What is the best way to monitor merging?
> Is there something in the log file that I can look for?
> It seems like I have to monitor the system resources - read/write IOPS etc.. 
> and work out when a merge happened
> It would be great if I can do it by looking at log files or in the admin UI. 
> Do you know if this can be done or if there is some tool for this?
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: 01 May 2012 15:12
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Hi Prabhu,
>
> I don't think such a merge policy exists, but it would be nice to have this 
> option and I imagine it wouldn't be hard to write if you really just base the 
> merge or no merge decision on the time of day (and maybe day of the week).
>
> Note that this should go into Lucene, not Solr, so if you decide to 
> contribute your work, please 
> see http://wiki.apache.org/lucene-java/HowToContribute
>
> Otis
> 
> Performance Monitoring for Solr - http://sematext.com/spm
>
>
>
>
>>
>> From: "Prakashganesh, Prabhu" 
>>To: "solr-user@lucene.apache.org" 
>>Sent: Tuesday, May 1, 2012 8:45 AM
>>Subject: Solr Merge during off peak times
>>
>>Hi,
>>  I would like to know if there is a way to configure index merge policy in 
>>solr so that the merging happens during off peak hours. Can you please let me 
>>know if such a merge policy configuration exists?
>>
>>Thanks
>>Prabhu
>>
>>
>>


Re: Lucene FieldCache - Out of memory exception

2012-05-02 Thread Jack Krupansky
The FieldCache gets populated the first time a given field is referenced as 
a facet and then will stay around forever. So, as additional queries get 
executed with different facet fields, the number of FieldCache entries will 
grow.


If I understand what you have said, theses faceted queries do work 
initially, but after awhile they stop working with OOM, correct?


The size of a single FieldCache depends on the field type. Since you are 
using dynamic fields, it depends on your "dynamicField" types - which you 
have not told us about. From your query I see that your fields start with 
"S_" and "F_" - presumably you have dynamic field types "S_*" and "F_*"? Are 
they strings, integers, floats, or what?


Each FieldCache will be an array with maxdoc entries (your total number of 
documents - 1.4 million) times the size of the field value or whatever a 
string reference is in your JVM.


String fields will take more space than numeric fields for the FieldCache, 
since a separate table is maintained for the unique terms in that field. 
Roughly what is the typical or average length of one of your facet field 
values? And, on average, how many unique terms are there within a typical 
faceted field?


If you can convert many of these faceted fields to simple integers the size 
should go down dramatically, but that depends on your application.


3 GB sounds like it might not be enough for such heavy use of faceting. It 
is probably not the 50-70 number, but the 440 or accumulated number across 
many queries that pushes the memory usage up.


When you hit OOM, what does the Solr admin stats display say for FieldCache?

-- Jack Krupansky

-Original Message- 
From: Rahul R

Sent: Wednesday, May 02, 2012 2:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Lucene FieldCache - Out of memory exception

Here is one sample query that I picked up from the log file :

q=*%3A*&fq=Category%3A%223__107%22&fq=S_P1540477699%3A%22MICROCIRCUIT%2C+LINE+TRANSCEIVERS%22&rows=0&facet=true&facet.mincount=1&facet.limit=2&facet.field=S_C1503120369&facet.field=S_P1406389942&facet.field=S_P1430116878&facet.field=S_P1430116881&facet.field=S_P1406453552&facet.field=S_P1406451296&facet.field=S_P1406452465&facet.field=S_C2968809156&facet.field=S_P1406389980&facet.field=S_P1540477699&facet.field=S_P1406389982&facet.field=S_P1406389984&facet.field=S_P1406451284&facet.field=S_P1406389926&facet.field=S_P1424886581&facet.field=S_P2017662632&facet.field=F_P1946367021&facet.field=S_P1430116884&facet.field=S_P2017662620&facet.field=F_P1406451304&facet.field=F_P1406451306&facet.field=F_P1406451308&facet.field=S_P1500901421&facet.field=S_P1507138990&facet.field=I_P1406452433&facet.field=I_P1406453565&facet.field=I_P1406452463&facet.field=I_P1406453573&facet.field=I_P1406451324&facet.field=I_P1406451288&facet.field=S_P1406451282&facet.field=S_P1406452471&facet.field=S_P14248866
05&facet.field=S_P1946367015&facet.field=S_P1424886598&facet.field=S_P1946367018&facet.field=S_P1406453556&facet.field=S_P1406389932&facet.field=S_P2017662623&facet.field=S_P1406450978&facet.field=F_P1406452455&facet.field=S_P1406389972&facet.field=S_P1406389974&facet.field=S_P1406389986&facet.field=F_P1946367027&facet.field=F_P1406451294&facet.field=F_P1406451286&facet.field=F_P1406451328&facet.field=S_P1424886593&facet.field=S_P1406453567&facet.field=S_P2017662629&facet.field=S_P1406453571&facet.field=F_P1946367030&facet.field=S_P1406453569&facet.field=S_P2017662626&facet.field=S_P1406389978&facet.field=F_P1946367024

My primary question here is, can Solr handle this kind of queries with so
many facet fields. I have tried using both enum and fc for facet.method and
there is no improvement with either.

Appreciate any help on this. Thank you.

- Rahul


On Mon, Apr 30, 2012 at 2:53 PM, Rahul R  wrote:


Hello,
I am using solr 1.3 with jdk 1.5.0_14 and weblogic 10MP1 application
server on Solaris. I use embedded solr server. More details :
Number of docs in solr index : 1.4 million
Physical size of index : 640MB
Total number of fields in the index : 700 (99% of these are dynamic 
fields)

Total number of fields enabled for faceting : 440
Avg number of facet fields participating in a faceted query : 50-70
Total RAM allocated to weblogic appserver : 3GB (max possible)

In a multi user environment with 3 users using this application for a
period of around 40 minutes, the application runs out of memory. Analysis
of the heap dump shows that almost 85% of the memory is retained by the
FieldCache. Now I understand that the field cache is out of our control 
but

would appreciate some suggestions on how to handle this issue.

Some questions on this front :
- some mail threads on this forum seem to indicate that there could be
some connection between having dynamic fields and usage of FieldCache. Is
this true ? Most of the fields in my index are dynamic fields.
- as mentioned above, most of my faceted queries could have around 50-70
facet fields (I would do SolrQuery.addFacetFie

Re: Solr Merge during off peak times

2012-05-02 Thread Erick Erickson
Why do you care? Merging is generally a background process, or are
you doing heavy indexing? In a master/slave setup,
it's usually not really relevant except that (with 3.x), massive merges
may temporarily stop indexing. Is that the problem?

Look at the merge policys, there are configurations that make
this less painful.

In trunk, DocumentWriterPerThread makes merges happen in the
background, which helps the long-pause-while-indexing problem.

Best
Erick

On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
 wrote:
> Ok, thanks Otis
> Another question on merging
> What is the best way to monitor merging?
> Is there something in the log file that I can look for?
> It seems like I have to monitor the system resources - read/write IOPS etc.. 
> and work out when a merge happened
> It would be great if I can do it by looking at log files or in the admin UI. 
> Do you know if this can be done or if there is some tool for this?
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: 01 May 2012 15:12
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Hi Prabhu,
>
> I don't think such a merge policy exists, but it would be nice to have this 
> option and I imagine it wouldn't be hard to write if you really just base the 
> merge or no merge decision on the time of day (and maybe day of the week).
>
> Note that this should go into Lucene, not Solr, so if you decide to 
> contribute your work, please 
> see http://wiki.apache.org/lucene-java/HowToContribute
>
> Otis
> 
> Performance Monitoring for Solr - http://sematext.com/spm
>
>
>
>
>>
>> From: "Prakashganesh, Prabhu" 
>>To: "solr-user@lucene.apache.org" 
>>Sent: Tuesday, May 1, 2012 8:45 AM
>>Subject: Solr Merge during off peak times
>>
>>Hi,
>>  I would like to know if there is a way to configure index merge policy in 
>>solr so that the merging happens during off peak hours. Can you please let me 
>>know if such a merge policy configuration exists?
>>
>>Thanks
>>Prabhu
>>
>>
>>


Re: should slave replication be turned off / on during master clean and re-index?

2012-05-02 Thread Erick Erickson
Simply turn off replication during your rebuild-from-scratch. See:
http://wiki.apache.org/solr/SolrReplication#HTTP_API
the "disabelreplication" command.

The autocommit thing was, I think, in reference to keeping
any replication of a partial-rebuild from being replicated.
Autocommit is usually a fine thing.

So your full-rebuild looks like this
1> disable replication on the master
2> rebuild the index (autocommit on or off, makes little difference as
far as replication)
3> enable replication on the master

Best
Erick

On Tue, May 1, 2012 at 8:55 AM, geeky2  wrote:
> hello shawn,
>
> thanks for the reply.
>
> ok - i did some testing and yes you are correct.
>
> autocommit is doing the "commit" work in chunks. yes - the slaves are also
> going to having everything to nothing, then slowly building back up again,
> lagging behind the master.
>
> ... and yes - this is probably not what we need - as far as a replication
> strategy for the slaves.
>
> you said, you don't use autocommit.  if so - then why don't you use / like
> autocommit?
>
> since we have not done this here - there is no established reference point,
> from an operations perspective.
>
> i am looking to formulate some sort of operation strategy, so ANY ideas or
> input is really welcome.
>
>
>
> it seems to me that we have to account for two operational strategies -
>
> the first operational mode is a "daily" append to the solr core after the
> database tables have been updated.  this can probably be done with a simple
> delta import.  i would think that autocommit could remain on for the master
> and replication could also be left on so the slaves picked up the changes
> ASAP.  this seems like the mode that we would / should be in most of the
> time.
>
>
> the second operational mode would be a "build from scratch" mode, where
> changes in the schema necessitated a full re-index of the data.  given that
> our site (powered by solr) must be up all of the time, and that our full
> index time on the master (for the moment) is hovering somewhere around 16
> hours - it makes sense that some sort of parallel path - with a cut-over,
> must be used.
>
> in this situation is it possible to have the indexing process going on in
> the background - then have one commit at the end - then turn replication on
> for the slaves?
>
> are there disadvantages to this approach?
>
> also - i really like your suggestion of a "build core" and "live core".  is
> this approach you use?
>
> thank you for all of the great input
>
>
>
>
> then
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/should-slave-replication-be-turned-off-on-during-master-clean-and-re-index-tp3945531p3952904.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Merge during off peak times

2012-05-02 Thread Prakashganesh, Prabhu
Ok, thanks Otis
Another question on merging
What is the best way to monitor merging?
Is there something in the log file that I can look for? 
It seems like I have to monitor the system resources - read/write IOPS etc.. 
and work out when a merge happened
It would be great if I can do it by looking at log files or in the admin UI. Do 
you know if this can be done or if there is some tool for this?

Thanks
Prabhu

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: 01 May 2012 15:12
To: solr-user@lucene.apache.org
Subject: Re: Solr Merge during off peak times

Hi Prabhu,

I don't think such a merge policy exists, but it would be nice to have this 
option and I imagine it wouldn't be hard to write if you really just base the 
merge or no merge decision on the time of day (and maybe day of the week).

Note that this should go into Lucene, not Solr, so if you decide to contribute 
your work, please see http://wiki.apache.org/lucene-java/HowToContribute

Otis

Performance Monitoring for Solr - http://sematext.com/spm




>
> From: "Prakashganesh, Prabhu" 
>To: "solr-user@lucene.apache.org"  
>Sent: Tuesday, May 1, 2012 8:45 AM
>Subject: Solr Merge during off peak times
> 
>Hi,
>  I would like to know if there is a way to configure index merge policy in 
>solr so that the merging happens during off peak hours. Can you please let me 
>know if such a merge policy configuration exists?
>
>Thanks
>Prabhu
>
>
>


Re: Solr: extracting/indexing HTML via cURL

2012-05-02 Thread Lance Norskog
You can have two fields: one which is stripped, and another which
stores the original data. You can use  directives and make
the "stripped" field indexed but not stored, and the original field
stored but not indexed. You only have to upload the file once, and
only store the text once.

If you look in the default schema, you'll find a bunch of text fields
are all copied to "text" or "text_all", which is indexed but not
stored. This catch-all field is the default search field.

http://lucidworks.lucidimagination.com/display/solr/Copying+Fields


On Mon, Apr 30, 2012 at 2:06 PM, okayndc  wrote:
> Great, thank you for the input.  My understanding of HTMLStripCharFilter is
> that it strips HTML tags, which is not what I want ~ is this correct?  I
> want to keep the HTML tags intact.
>
> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky 
> wrote:
>
>> If by "extracting HTML content via cURL" you mean using SolrCell to parse
>> html files, this seems to make sense. The sequence is that regardless of
>> the file type, each file extraction "parser" will strip off all formatting
>> and produce a raw text stream. Office, PDF, and HTML files are all treated
>> the same in that way. Then, the unformatted text stream is sent through the
>> field type analyzers to be tokenized into terms that Lucene can index. The
>> input string to the field type analyzer is what gets stored for the field,
>> but this occurs after the extraction file parser has already removed
>> formatting.
>>
>> No way for the formatting to be preserved in that case, other than to go
>> back to the original input document before extraction parsing.
>>
>> If you really do want to preserve full HTML formatted text, you would need
>> to define a field whose field type uses the HTMLStripCharFilter and then
>> directly add documents that direct the raw HTML to that field.
>>
>> There may be some other way to hook into the update processing chain, but
>> that may be too much effort compared to the HTML strip filter.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: okayndc
>> Sent: Monday, April 30, 2012 10:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr: extracting/indexing HTML via cURL
>>
>>
>> Hello,
>>
>> Over the weekend I experimented with extracting HTML content via cURL and
>> just
>> wondering why the extraction/indexing process does not include the HTML
>> tags.
>> It seems as though the HTML tags either being ignored or stripped somewhere
>> in the pipeline.
>> If this is the case, is it possible to include the HTML tags, as I would
>> like to keep the
>> formatted HTML intact?
>>
>> Any help is greatly appreciated.
>>



-- 
Lance Norskog
goks...@gmail.com