Re: Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Okay. Can you please suggest a way (with an example) of assigning this unique
key to a pdf file. Say, a unique number to each pdf file. How do i achieve
this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to re-index Solr & get term frequency within documents

2013-07-01 Thread Tony Mullins
I use Nutch as input datasource for my Solr.
So I cannot re-run all the Nutch jobs to generate data again for Solr as it
will take very long to generate that much data.

I was hoping there would be an easier way inside Solr to just re-index all
the existing data.

Thanks,
Tony


On Tue, Jul 2, 2013 at 1:37 AM, Jack Krupansky wrote:

> Or, go with a commercial product that has a single-click Solr re-index
> capability, such as:
>
> 1. DataStax Enterprise - data is stored in Cassandra and reindexed into
> Solr from there.
>
> 2. LucidWorks Search - data sources are declared so that the package can
> automatically re-crawl the data sources.
>
> But, yeah, as Otis says, "re-index" is really just a euphemism for
> deleting your Solr data directory and indexing from scratch from the
> original data sources.
>
> -- Jack Krupansky
>
> -Original Message- From: Otis Gospodnetic
> Sent: Monday, July 01, 2013 2:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to re-index Solr & get term frequency within documents
>
>
> If all your fields are stored, you can do it with
> http://search-lucene.com/?q=**solrentityprocessor
>
> Otherwise, just reindex the same way you indexed in the first place.
> *Always* be ready to reindex from scratch.
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins 
> wrote:
>
>> Thanks Jack , it worked.
>>
>> Could you please provide some info on how to re-index existing data in
>> Solr, after changing the schema.xml ?
>>
>> Thanks,
>> Tony
>>
>>
>> On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky *
>> *wrote:
>>
>>  You can write any function query in the field list of the "fl" parameter.
>>> Sounds like you want "termfreq":
>>>
>>> termfreq(field_arg,term)
>>>
>>> fl=id,a,b,c,termfreq(a,xyz)
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Tony Mullins
>>> Sent: Monday, July 01, 2013 10:47 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: How to re-index Solr & get term frequency within documents
>>>
>>>
>>> Hi,
>>>
>>> I am using Solr 4.3.0.
>>> If I change my solr's schema.xml then do I need to re-index my solr ? And
>>> if yes , how to ?
>>>
>>> My 2nd question is I need to find the frequency of term per document in
>>> all
>>> documents of search result.
>>>
>>> My field is
>>>
>>> >> multiValued="true" termVectors="true" termPositions="true"
>>> termOffsets="true"/>
>>>
>>> And I am trying this query
>>>
>>> http://localhost:8080/solr/select/?q=iphone&fl=AuthorX%**
>>> 2CTitleX%2CCommentX&df=CommentX&wt=xml&indent=true&**
>>> qt=tvrh&tv=true&tv.tf=true&tv.df=true&tv.positions&tv.
>>> offsets=true>> AuthorX%2CTitleX%2CCommentX&**df=CommentX&wt=xml&indent=**
>>> true&qt=tvrh&tv=true&tv.tf=**true&tv.df=true&tv.positions&**
>>> tv.offsets=true
>>> >
>>>
>>> Its just returning me the result set, no info on my searched term's
>>> (iphone) frequency in each document.
>>>
>>> How can I make Solr to return the frequency of searched term per document
>>> in result set ?
>>>
>>> Thanks,
>>> Tony.
>>>
>>>
>


Re: Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Can you please suggest a way (with example) of assigning this unique key to a
pdf file?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074588.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Schema design for parent child field

2013-07-01 Thread Mikhail Khludnev
from my experience deeply nested scopes is for SOLR-3076 almost only.


On Sat, Jun 29, 2013 at 1:08 PM, Sperrink
wrote:

> Good day,
> I'm seeking some guidance on how best to represent the following data
> within
> a solr schema.
> I have a list of subjects which are detailed to n levels.
> Each document can contain many of these subject entities.
> As I see it if this had been just 1 subject per document, dynamic fields
> would have been a good resolution.
> Any suggestions on how best to create this structure in a denormalised
> fashion while maintaining the data integrity.
> For example a document could have:
> Subject level 1: contract
> Subject level 2: claims
> Subject level 1: patent
> Subject level 2: counter claims
>
> If I were to search for level 1 contract, I would only want the facet count
> for level 2 to contain claims and not counter claims.
>
> Any assistance in this would be much appreciated.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Schema-design-for-parent-child-field-tp4074084.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Converting nested data model to solr schema

2013-07-01 Thread Mikhail Khludnev
On Mon, Jul 1, 2013 at 5:56 PM, adfel70  wrote:

> This requires me to override the solr document distribution mechanism.
> I fear that with this solution I may loose some of solr cloud's
> capabilities.
>

It's not clear whether you aware of
http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you
did doesn't sound scary to me. If it works, it should be fine. I'm not
aware of any capabilities that you are going to loose.
Obviously SOLR-3076 provides astonishing query time performance, with
offloading actual join work into index time. Check it if you current
approach turns slow.


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: dataconfig to index ZIP Files

2013-07-01 Thread ericrs22
not sure if this will help any.

Here's the verbose log 

INFO  - 2013-07-01 23:17:08.632;
org.apache.solr.handler.dataimport.DataImporter; Loading DIH Configuration:
tika-data-config.xml
INFO  - 2013-07-01 23:17:08.648;
org.apache.solr.handler.dataimport.DataImporter; Data Configuration loaded
successfully
INFO  - 2013-07-01 23:17:08.663; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={optimize=false&clean=false&indent=true&commit=false&verbose=true&entity=Archive&command=full-import&debug=false&wt=json}
status=0 QTime=31 
INFO  - 2013-07-01 23:17:08.663;
org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
INFO  - 2013-07-01 23:17:08.679; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1372720628679&wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:08.679;
org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read
dataimport.properties
INFO  - 2013-07-01 23:17:09.552; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1372720629552&wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:11.580; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1372720631577&wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:13.593; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1372720633593&wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:15.247;
org.apache.solr.handler.dataimport.DocBuilder; Time taken = 0:0:6.553
INFO  - 2013-07-01 23:17:15.247;
org.apache.solr.update.processor.LogUpdateProcessor; [tika] webapp=/solr
path=/dataimport
params={optimize=false&clean=false&indent=true&commit=false&verbose=true&entity=Archive&command=full-import&debug=false&wt=json}
status=0 QTime=31 {} 0 31
INFO  - 2013-07-01 23:17:15.621; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1372720635621&wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:17.259; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1372720637256&wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:17.649; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1372720637645&wt=json} status=0 QTime=0 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074498.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Disable Document Id from being printed in the logs...

2013-07-01 Thread Shawn Heisey

On 7/1/2013 3:24 PM, Niran Fajemisin wrote:

I noticed that for Solr 4.2, when an internal call is made between two nodes 
Solr uses the list of matching document ids to fetch the document details. At 
this time, it prints out all matching document ids as a part of the query. Is 
there a way to suppress these log statements from being created?


There's no way for Solr to distinguish between requests made by another 
Solr core and requests made by "real" clients.  Paying attention to the 
IP address where the request originated won't work either - a lot of 
Solr installations run on the same hardware as the web server or other 
application that *uses* Solr.


Debugging a problem becomes very difficult if you come up with *ANY* way 
to stop logging these requests.  That said, on newer versions the 
parameter 'distrib=false' should be included on those requests that you 
don't want to log, so an option to turn off logging of non-distributed 
requests might be a reasonable idea.  I think you'll run into some 
resistance, but as long as it doesn't default to enabled, it might be 
something that could be added.


If you are worried about performance, update the logging configuration 
so that Solr only logs at WARN, that way no requests will be logged.  If 
you then need to debug, you can change the logging to INFO using the 
admin UI, get your debugging done, and then turn it back down to WARN. 
This is the best logging approach from a performance perspective.


Thanks,
Shawn



Re: full-import failed after 5 hours with Exception: ORA-01555: snapshot too old: rollback segment number with name "" too small ORA-22924: snapshot too old

2013-07-01 Thread Michael Della Bitta
I would say definitely investigate the performance of the query, but also
since you're using CachedSqlEntityProcessor, you might want to back off on
the transaction isolation to READ_COMMITTED, which I think is the lowest
one that Oracle supports:

http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Fri, Jun 28, 2013 at 2:52 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> I'd go talk to the DBA.  How long does this query take if you run it
> directly against Oracle?  How long if you run it locally vs. from a
> remove server (like Solr is in relation to your Oracle server(s)).
> What happens if you increase batchSize?
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Thu, Jun 27, 2013 at 6:41 PM, srinalluri 
> wrote:
> > Hello,
> >
> > I am using Solr 4.3.2 and Oracle DB. The sub entity is using
> > CachedSqlEntityProcessor. The dataSource is having batchSize="500". The
> > full-import is failed with 'ORA-01555: snapshot too old: rollback segment
> > number  with name "" too small ORA-22924: snapshot too old' Exception
> after
> > 5 hours.
> >
> > We already increased the undo space 4 times at the database end. Number
> of
> > records in the jan_story table is 800,000 only. Tomcat is with 4GB JVM
> > memory.
> >
> > Following is the entity (there are other sub-entities, I didn't mention
> them
> > here. As the import failed with article_details entity. article_details
> is
> > the first sub-entity)
> >
> >  > preImportDeleteQuery="content_type:article AND
> > repository:par8qatestingprod"
> > query="select ID as VCMID from jan_story">
> >  > transformer="TemplateTransformer,ClobTransformer,RegexTransformer"
> >   query="select bb.recordid, aa.ID as DID,aa.STORY_TITLE,
> > aa.STORY_HEADLINE, aa.SOURCE, aa.DECK, regexp_replace(aa.body,
> > '\\[(pullquote|summary)\]\|\[video [0-9]+?\]|\[youtube
> > .+?\]', '') as BODY, aa.PUBLISHED_DATE, aa.MODIFIED_DATE, aa.DATELINE,
> > aa.REPORTER_NAME, aa.TICKER_CODES,aa.ADVERTORIAL_CONTENT from jan_story
> > aa,mapp bb where aa.id=bb.keystring1" cacheKey="DID"
> > cacheLookup="par8-article-testingprod.VCMID"
> > processor="CachedSqlEntityProcessor" >
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >   
> >   
> >
> >
> > The full-import without CachedSqlEntityProcessor is taking 7 days. That
> is
> > why I am doing all this.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/full-import-failed-after-5-hours-with-Exception-ORA-01555-snapshot-too-old-rollback-segment-number-wd-tp4073822.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Disable Document Id from being printed in the logs...

2013-07-01 Thread Niran Fajemisin
Hi all,

I noticed that for Solr 4.2, when an internal call is made between two nodes 
Solr uses the list of matching document ids to fetch the document details. At 
this time, it prints out all matching document ids as a part of the query. Is 
there a way to suppress these log statements from being created?

Thanks.
Niran

Re: Improving performance to return 2000+ documents

2013-07-01 Thread Utkarsh Sengar
Thanks Erick/Jagdish.

Just to give some background on my queries.

1. All my queries are unique. A query can be: "ipod" and "ipod 8gb" (but
these are unique). These are about 1.2M in total.
So, I assume setting a high queryResultCache, queryResultWindowSize and
queryResultMaxDocsCached won't help.

2. I have this cache settings:

//My understanding is, documentCache will help me the most because solr
will cache documents retrieved.
//Stats for documentCache: http://apaste.info/hknh


//Default, since my queries are unique.


//Now sure how can I use filterCache, so I am keeping it as the default

true
100
100


I think the question can also be framed as: How can I optimize solr
response time for 50M product catalog for unique queries which retrieves
2000 documents in one go.
I looked at a solr search component, I think writing a "proxy" around solr
was easier, so I went ahead with this approach.


Thanks,
-Utkarsh




On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula wrote:

> Solrconfig.xml has got entries which you can tweak for your use case. One
> of them is queryresultwindowsize. You can try using the value of 2000 and
> see if it helps improving performance. Please make sure you have enough
> memory allocated for queryresultcache.
> A combination of sharding and distribution of workload(requesting
> 2000/number of shards) with an aggregator would be a good way to maximize
> performance.
>
> Thanks,
>
> Jagdish
>
>
> On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson  >wrote:
>
> > 50M documents, depending on a bunch of things,
> > may not be unreasonable for a single node, only
> > testing will tell.
> >
> > But the question I have is whether you should be
> > using standard Solr queries for this or building a custom
> > component that goes at the base Lucene index
> > and "does the right thing". Or even re-indexing your
> > entire corpus periodically to add this kind of data.
> >
> > FWIW,
> > Erick
> >
> >
> > On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar  > >wrote:
> >
> > > Thanks Erick/Peter.
> > >
> > > This is an offline process, used by a relevancy engine implemented
> around
> > > solr. The engine computes boost scores for related keywords based on
> > > clickstream data.
> > > i.e.: say clickstream has: ipad=upc1,upc2,upc3
> > > I query solr with keyword: "ipad" (to get 2000 documents) and then
> make 3
> > > individual queries for upc1,upc2,upc3 (which are fast).
> > > The data is then used to compute related keywords to "ipad" with their
> > > boost values.
> > >
> > > So, I cannot really replace that, since I need full text search over my
> > > dataset to retrieve top 2000 documents.
> > >
> > > I tried paging: I retrieve 500 solr documents 4 times (0-500,
> > 500-1000...),
> > > but don't see any improvements.
> > >
> > >
> > > Some questions:
> > > 1. Maybe the JVM size might help?
> > > This is what I see in the dashboard:
> > > Physical Memory 76.2%
> > > Swap Space NaN% (don't have any swap space, running on AWS EBS)
> > > File Descriptor Count 4.7%
> > > JVM-Memory 73.8%
> > >
> > > Screenshot: http://i.imgur.com/aegKzP6.png
> > >
> > > 2. Will reducing the shards from 3 to 1 improve performance? (maybe
> > > increase the RAM from 30 to 60GB) The problem I will face in that case
> > will
> > > be fitting 50M documents on 1 machine.
> > >
> > > Thanks,
> > > -Utkarsh
> > >
> > >
> > > On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge  > > >wrote:
> > >
> > > > Hello Utkarsh,
> > > > This may or may not be relevant for your use-case, but the way we
> deal
> > > with
> > > > this scenario is to retrieve the top N documents 5,10,20or100 at a
> time
> > > > (user selectable). We can then page the results, changing the start
> > > > parameter to return the next set. This allows us to 'retrieve'
> millions
> > > of
> > > > documents - we just do it at the user's leisure, rather than make
> them
> > > wait
> > > > for the whole lot in one go.
> > > > This works well because users very rarely want to see ALL 2000 (or
> > > whatever
> > > > number) documents at one time - it's simply too much to take in at
> one
> > > > time.
> > > > If your use-case involves an automated or offline procedure (e.g.
> > > running a
> > > > report or some data-mining op), then presumably it doesn't matter so
> > much
> > > > it takes a bit longer (as long as it returns in some reasonble time).
> > > > Have you looked at doing paging on the client-side - this will hugely
> > > > speed-up your search time.
> > > > HTH
> > > > Peter
> > > >
> > > >
> > > >
> > > > On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson <
> > erickerick...@gmail.com
> > > > >wrote:
> > > >
> > > > > Well, depending on how many docs get served
> > > > > from the cache the time will vary. But this is
> > > > > just ugly, if you can avoid this use-case it would
> > > > > be a Good Thing.
> > > > >
> > > > > Problem here is that each and every shard must
> > > > > assemble the list of 2,000 documents (just ID and
> > > > > sort criteria, usually sc

Using per-segment FieldCache or DocValues in custom component?

2013-07-01 Thread Michael Ryan
I have some custom code that uses the top-level FieldCache (e.g., 
FieldCache.DEFAULT.getLongs(reader, "foobar", false)). I'd like to redesign 
this to use the per-segment FieldCaches so that re-opening a Searcher is 
fast(er). In most cases, I've got a docId and I want to get the value for a 
particular single-valued field for that doc.

Is there a good place to look to see example code of per-segment FieldCache 
use? I've been looking at PerSegmentSingleValuedFaceting, but hoping there 
might be something less confusing :)

Also thinking DocValues might be a better way to go for me... is there any 
documentation or example code for that?

-Michael


Re: How to re-index Solr & get term frequency within documents

2013-07-01 Thread Jack Krupansky
Or, go with a commercial product that has a single-click Solr re-index 
capability, such as:


1. DataStax Enterprise - data is stored in Cassandra and reindexed into Solr 
from there.


2. LucidWorks Search - data sources are declared so that the package can 
automatically re-crawl the data sources.


But, yeah, as Otis says, "re-index" is really just a euphemism for deleting 
your Solr data directory and indexing from scratch from the original data 
sources.


-- Jack Krupansky

-Original Message- 
From: Otis Gospodnetic

Sent: Monday, July 01, 2013 2:26 PM
To: solr-user@lucene.apache.org
Subject: Re: How to re-index Solr & get term frequency within documents

If all your fields are stored, you can do it with
http://search-lucene.com/?q=solrentityprocessor

Otherwise, just reindex the same way you indexed in the first place.
*Always* be ready to reindex from scratch.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins  
wrote:

Thanks Jack , it worked.

Could you please provide some info on how to re-index existing data in
Solr, after changing the schema.xml ?

Thanks,
Tony


On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky 
wrote:



You can write any function query in the field list of the "fl" parameter.
Sounds like you want "termfreq":

termfreq(field_arg,term)

fl=id,a,b,c,termfreq(a,xyz)


-- Jack Krupansky

-Original Message- From: Tony Mullins
Sent: Monday, July 01, 2013 10:47 AM
To: solr-user@lucene.apache.org
Subject: How to re-index Solr & get term frequency within documents


Hi,

I am using Solr 4.3.0.
If I change my solr's schema.xml then do I need to re-index my solr ? And
if yes , how to ?

My 2nd question is I need to find the frequency of term per document in 
all

documents of search result.

My field is



And I am trying this query

http://localhost:8080/solr/**select/?q=iphone&fl=AuthorX%**
2CTitleX%2CCommentX&df=**CommentX&wt=xml&indent=true&**
qt=tvrh&tv=true&tv.tf=true&tv.**df=true&tv.positions&tv.**offsets=true

Its just returning me the result set, no info on my searched term's
(iphone) frequency in each document.

How can I make Solr to return the frequency of searched term per document
in result set ?

Thanks,
Tony.





Re: are fields stored or unstored by default xml

2013-07-01 Thread Jack Krupansky
Correct - the field definitions inherit the attributes of the field type, 
and it is the field type that has the actual default values for indexed and 
stored (and other attributes.)


-- Jack Krupansky

-Original Message- 
From: Yonik Seeley

Sent: Monday, July 01, 2013 3:56 PM
To: solr-user@lucene.apache.org
Subject: Re: are fields stored or unstored by default xml

On Mon, Jul 1, 2013 at 3:50 PM, Jack Krupansky  
wrote:

"stored" and "indexed" both default to "true".

This is legal:

   


Actually, for fields I believe the defaults come from the fieldType.
The fieldType defaults to true for both indexed and stored if they are
not specified there.

-Yonik
http://lucidworks.com 



Re: "Classic" 4.2 master-slave replication not completing

2013-07-01 Thread Shawn Heisey

On 7/1/2013 1:07 PM, Neal Ensor wrote:

is it conceivable that there's too much traffic, causing Solr to stall
re-opening the searcher (thus releasing to the new index)?  I'm grasping at
straws, and this is beginning to bug me a lot.  The traffic logs wouldn't
seem to support this (apart from periodic health-check pings, the load is
distributed fairly evenly across 3 slaves by a load-balancer tool).  After
35+ minutes this morning, none of the three successfully "unstuck", and had
to be manually core-reloaded.

Is there perhaps a configuration element I'm overlooking that might make
solr a bit less "friendly" about it, and just dump the searchers/reopen
when replication completes?


Can you share your solrconfig.xml file, someplace like 
http://apaste.info?  Please be sure to choose the correct file type ... 
on that website it is (X)HTML for an XML file.



As a side note, I'm getting really frustrated with trying to get log4j
logging on 4.3.1 set up; my tomcat container persists in complaining that
it cannot find log4j.properties, when I've put it in the WEB-INF/classes of
the war file, have SLF4j libraries AND log4j at the shared container "lib"
level, and log4j.debug turned on.  I can't find any excuses why it cannot
seem to locate the configuration.


The wiki is still down for maintenance, so below is a relevant section 
of the SolrLogging wiki page extracted from Google Cache.  When it comes 
back up, you can find it at this URL:


http://wiki.apache.org/solr/SolrLogging#Switching_from_Log4J_back_to_JUL_.28java.util.logging.29

=
The example logging setup takes over the configuration of Solr logging, 
which prevents the container from controlling where logs go. Users of 
containers other than the included Jetty (Tomcat in particular) may be 
accustomed to doing the logging configuration in the container. If you 
want to switch back to java.util.logging so this is once again possible, 
here's what to do. These steps apply to the example/lib/ext directory in 
the Solr example, or to your container's lib directory as mentioned in 
the previous section. These steps also assume that the slf4j version is 
1.6.6, which comes with Solr4.3. Newer versions may use a different 
slf4j version. As of May 2013, you can use a newer SLF4J version with no 
trouble, but be aware that all slf4j components in your classpath must 
be the same version.


Download slf4j version 1.6.6 (the version used in Solr4.3.x). 
http://www.slf4j.org/dist/slf4j-1.6.6.zip

Unpack the slf4j archive.
Delete these JARs from your lib folder: slf4j-log4j12-1.6.6.jar, 
jul-to-slf4j-1.6.6.jar, log4j-1.2.16.jar
Add these JARs to your lib folder (from slf4j zip): 
slf4j-jdk14-1.6.6.jar, log4j-over-slf4j-1.6.6.jar

Use your old logging.properties
=

Thanks,
Shawn



Re: are fields stored or unstored by default xml

2013-07-01 Thread Yonik Seeley
On Mon, Jul 1, 2013 at 3:50 PM, Jack Krupansky  wrote:
> "stored" and "indexed" both default to "true".
>
> This is legal:
>
>

Actually, for fields I believe the defaults come from the fieldType.
The fieldType defaults to true for both indexed and stored if they are
not specified there.

-Yonik
http://lucidworks.com


Re: are fields stored or unstored by default xml

2013-07-01 Thread Jack Krupansky

"stored" and "indexed" both default to "true".

This is legal:

   

This detail will be in Early Access Release #2 of my book on Friday.

-- Jack Krupansky

-Original Message- 
From: Otis Gospodnetic 
Sent: Monday, July 01, 2013 2:21 PM 
To: solr-user@lucene.apache.org 
Subject: Re: are fields stored or unstored by default xml 


Haven't tried it recently, but is that even legal?  Just be explicit :)

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 2:16 PM, Katie McCorkell
 wrote:

In schema.xml I know you can label a field as stored="false" or
stored="true", but if you say neither, which is it by default?

Thank you
Katie


Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-01 Thread Shawn Heisey

On 7/1/2013 12:56 PM, Mike L. wrote:

  Hey Ahmet / Solr User Group,

I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields


My response from solr



0591


I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page:

numDocs 1000
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000,
maxDoc 1000.


A discrepancy between numDocs and maxDoc indicates that there are 
deleted documents in your index.  You might already know this, so here's 
an answer to what I think might be your actual question:


If you want to delete the 1000 existing documents before adding the 10 
documents, then you have to actually do that deletion.  The CSV update 
handler works at a lower level than the DataImport handler, and doesn't 
have "clean" or "full-import" options, which defaults to clean=true. 
The DIH is like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When 
clean=true or using full-import without a clean option, DIH itself sends 
a "delete all documents" update request.


If you didn't already know the bit about the deleted documents, then 
read this:


It can be normal for indexing "new" documents to cause deleted 
documents.  This happens when you have the same value in your UniqueKey 
field as documents that are already in your index.  Solr knows by the 
config you gave it that they are the same document, so it deletes the 
old one before adding the new one.  Solr has no way to know whether the 
document it already had or the document you are adding is more current, 
so it assumes you know what you are doing and takes care of the deletion 
for you.


When you optimize your index, deleted documents are purged, which is why 
the numbers match there.


Thanks,
Shawn



Perf. difference when the solr core is 'current' or not 'current'

2013-07-01 Thread jchen2000
in Solr's admin statistics page, there is a 'current' flag indicating whether
the core index reader is 'current' or not. According to some discussions in
this mailing list a few months back, it wouldn't affect anything. But my
observation is completely different. When the current flag was not checked
for some of the cores ( I have defined 15 cores in total), my median search
latency over 48M records was over 190ms, but if every current flag was
checked, the median dropped to only 87 ms. 

Another observation is, restarting solr instance may not necessarily make
'current' flags  checked,  have to reload cores even after starting solr.

Could anybody explain the difference? I am using Datastax Enterprise 3.0.2

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Perf-difference-when-the-solr-core-is-current-or-not-current-tp4074438.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: "Classic" 4.2 master-slave replication not completing

2013-07-01 Thread Neal Ensor
is it conceivable that there's too much traffic, causing Solr to stall
re-opening the searcher (thus releasing to the new index)?  I'm grasping at
straws, and this is beginning to bug me a lot.  The traffic logs wouldn't
seem to support this (apart from periodic health-check pings, the load is
distributed fairly evenly across 3 slaves by a load-balancer tool).  After
35+ minutes this morning, none of the three successfully "unstuck", and had
to be manually core-reloaded.

Is there perhaps a configuration element I'm overlooking that might make
solr a bit less "friendly" about it, and just dump the searchers/reopen
when replication completes?

As a side note, I'm getting really frustrated with trying to get log4j
logging on 4.3.1 set up; my tomcat container persists in complaining that
it cannot find log4j.properties, when I've put it in the WEB-INF/classes of
the war file, have SLF4j libraries AND log4j at the shared container "lib"
level, and log4j.debug turned on.  I can't find any excuses why it cannot
seem to locate the configuration.

Any suggestions or pointers would be greatly appreciated.  Thanks!


On Thu, Jun 27, 2013 at 10:35 AM, Mark Miller  wrote:

> Odd - looks like it's stuck waiting to be notified that a new searcher is
> ready.
>
> - Mark
>
> On Jun 27, 2013, at 8:58 AM, Neal Ensor  wrote:
>
> > Okay, I have done this (updated to 4.3.1 across master and four slaves;
> one
> > of these is my own PC for experiments, it is not being accessed by
> clients).
> >
> > Just had a minor replication this morning, and all three slaves are
> "stuck"
> > again.  Replication supposedly started at 8:40, ended 30 seconds later or
> > so (on my local PC, set up identically to the other three slaves).  The
> > three slaves will NOT complete the roll-over to the new index.  All three
> > index folders have a write.lock and latest files are dated 8:40am (now it
> > is 8:54am, with no further activity in the index folders).  There exists
> an
> > "index.2013062708461" (or some variation thereof) in all three
> slaves'
> > data folder.
> >
> > The seemingly-relevant thread dump of a "snappuller" thread on each of
> > these slaves:
> >
> >   - sun.misc.Unsafe.park(Native Method)
> >   - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
> >   -
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
> >   -
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
> >   -
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
> >   - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
> >   - java.util.concurrent.FutureTask.get(FutureTask.java:83)
> >   -
> >
> org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631)
> >   -
> >
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446)
> >   -
> >
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
> >   - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223)
> >   -
> >   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >   -
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >   - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >   -
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >   -
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >   -
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >   -
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> >   -
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> >   - java.lang.Thread.run(Thread.java:662)
> >
> >
> > Here they sit.  My local PC "slave" replicated very quickly, switched
> over
> > to the new generation (206) immediately.  I am not sure why the three
> > slaves are dragging on this.  If there's any configuration elements or
> > other details you need, please let me know.  I can manually "kick" them
> by
> > reloading the core from the admin pages, but obviously I would like this
> to
> > be a hands-off process.  Any help is greatly appreciated; this has been
> > bugging me for some time now.
> >
> >
> >
> > On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar <
> > shalinman...@gmail.com> wrote:
> >
> >> A bunch of replication related issues were fixed in 4.2.1 so you're
> >> better off upgrading to 4.2.1 or later (4.3.1 is the latest release).
> >>
> >> On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor  wrote:
> >>> As a bit of background, we run a setup (coming from 3.6.1 to 4.2
> >> relatively
> >>> recently) with a single maste

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-01 Thread Mike L.
 Hey Ahmet / Solr User Group,
 
   I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields


My response from solr 



0591

 
I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page: 

numDocs 1000 
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000, 
maxDoc 1000.

Perhaps the commit=true is not doing what its supposed to or am I missing 
something? I also trying passing a commit afterward like this:
http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't 
seem to do anything either)
 

From: Ahmet Arslan 
To: "solr-user@lucene.apache.org" ; Mike L. 
 
Sent: Saturday, June 29, 2013 7:20 AM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


Hi Mike,


You could try http://wiki.apache.org/solr/UpdateCSV 

And make sure you commit at the very end.





From: Mike L. 
To: "solr-user@lucene.apache.org"  
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the "recommended" approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the "standard" approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 

 
   
 
 
 
   

 
Thanks in advance,
Mike

Thanks in advance,
Mike

Re: How to re-index Solr & get term frequency within documents

2013-07-01 Thread Otis Gospodnetic
If all your fields are stored, you can do it with
http://search-lucene.com/?q=solrentityprocessor

Otherwise, just reindex the same way you indexed in the first place.
*Always* be ready to reindex from scratch.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins  wrote:
> Thanks Jack , it worked.
>
> Could you please provide some info on how to re-index existing data in
> Solr, after changing the schema.xml ?
>
> Thanks,
> Tony
>
>
> On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky wrote:
>
>> You can write any function query in the field list of the "fl" parameter.
>> Sounds like you want "termfreq":
>>
>> termfreq(field_arg,term)
>>
>> fl=id,a,b,c,termfreq(a,xyz)
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Tony Mullins
>> Sent: Monday, July 01, 2013 10:47 AM
>> To: solr-user@lucene.apache.org
>> Subject: How to re-index Solr & get term frequency within documents
>>
>>
>> Hi,
>>
>> I am using Solr 4.3.0.
>> If I change my solr's schema.xml then do I need to re-index my solr ? And
>> if yes , how to ?
>>
>> My 2nd question is I need to find the frequency of term per document in all
>> documents of search result.
>>
>> My field is
>>
>> > multiValued="true" termVectors="true" termPositions="true"
>> termOffsets="true"/>
>>
>> And I am trying this query
>>
>> http://localhost:8080/solr/**select/?q=iphone&fl=AuthorX%**
>> 2CTitleX%2CCommentX&df=**CommentX&wt=xml&indent=true&**
>> qt=tvrh&tv=true&tv.tf=true&tv.**df=true&tv.positions&tv.**offsets=true
>>
>> Its just returning me the result set, no info on my searched term's
>> (iphone) frequency in each document.
>>
>> How can I make Solr to return the frequency of searched term per document
>> in result set ?
>>
>> Thanks,
>> Tony.
>>


Re: are fields stored or unstored by default xml

2013-07-01 Thread Otis Gospodnetic
Haven't tried it recently, but is that even legal?  Just be explicit :)

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 2:16 PM, Katie McCorkell
 wrote:
> In schema.xml I know you can label a field as stored="false" or
> stored="true", but if you say neither, which is it by default?
>
> Thank you
> Katie


Re: dataconfig to index ZIP Files

2013-07-01 Thread ericrs22
I'm using the Tika plugin to do so and according to
http://tika.apache.org/0.5/formats.html it does


*ZIP archive (application/zip) Tika uses Java's built-in Zip classes to
parse ZIP files.
Support for ZIP was added in Tika 0.2.*



--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074421.html
Sent from the Solr - User mailing list archive at Nabble.com.


are fields stored or unstored by default xml

2013-07-01 Thread Katie McCorkell
In schema.xml I know you can label a field as stored="false" or
stored="true", but if you say neither, which is it by default?

Thank you
Katie


Re: dataconfig to index ZIP Files

2013-07-01 Thread Noble Paul നോബിള്‍ नोब्ळ्
IIRC Zip files are not supported


On Mon, Jul 1, 2013 at 10:30 PM, ericrs22  wrote:

> To answer the previous Post:
>
> I was not sure what datasource="binaryFile" I took it from a PDF sample
> thinking that would help.
>
> after setting datasource="null" I'm still gett the same errors...
>
> 
>  password="SomePassword" />
> 
>processor="FileListEntityProcessor" baseDir="E:\ArchiveRoot"
> fileName=".zip$" recursive="true" rootEntity="false" dataSource="null"
> onError="skip">
>
> 
>  name="filename"/>
>
> 
>
> 
> 
>
> the logs report this:
>
>
> INFO  - 2013-07-01 16:45:57.317;
> org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
> WARN  - 2013-07-01 16:45:57.333;
> org.apache.solr.handler.dataimport.SimplePropertiesWriter; Unable to read:
> dataimport.properties
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074399.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
-
Noble Paul


Re: How to re-index Solr & get term frequency within documents

2013-07-01 Thread Tony Mullins
Thanks Jack , it worked.

Could you please provide some info on how to re-index existing data in
Solr, after changing the schema.xml ?

Thanks,
Tony


On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky wrote:

> You can write any function query in the field list of the "fl" parameter.
> Sounds like you want "termfreq":
>
> termfreq(field_arg,term)
>
> fl=id,a,b,c,termfreq(a,xyz)
>
>
> -- Jack Krupansky
>
> -Original Message- From: Tony Mullins
> Sent: Monday, July 01, 2013 10:47 AM
> To: solr-user@lucene.apache.org
> Subject: How to re-index Solr & get term frequency within documents
>
>
> Hi,
>
> I am using Solr 4.3.0.
> If I change my solr's schema.xml then do I need to re-index my solr ? And
> if yes , how to ?
>
> My 2nd question is I need to find the frequency of term per document in all
> documents of search result.
>
> My field is
>
>  multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
>
> And I am trying this query
>
> http://localhost:8080/solr/**select/?q=iphone&fl=AuthorX%**
> 2CTitleX%2CCommentX&df=**CommentX&wt=xml&indent=true&**
> qt=tvrh&tv=true&tv.tf=true&tv.**df=true&tv.positions&tv.**offsets=true
>
> Its just returning me the result set, no info on my searched term's
> (iphone) frequency in each document.
>
> How can I make Solr to return the frequency of searched term per document
> in result set ?
>
> Thanks,
> Tony.
>


Re: Does solr cloud required passwordless ssh?

2013-07-01 Thread Mark Miller
No, SolrCloud does not currently use ssh.

- Mark

On Jul 1, 2013, at 12:58 PM, adfel70  wrote:

> Hi
> Does solr cloud on a cluster of servers require passwordless ssh to be
> configured between the servers?
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Does-solr-cloud-required-passwordless-ssh-tp4074398.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: cores sharing an instance

2013-07-01 Thread Roman Chyla
as for the second option:

If you look inside SolrResourceLoader, you will notice that before a
CoreContainer is created, a new class loader is also created

line:111

this.classLoader = createClassLoader(null, parent);

however, this parent object is always null, because it is called from:

public SolrResourceLoader( String instanceDir )
  {
this( instanceDir, null, null );
  }

but if you were able to replace the second null (parent class loader) with
a classloader of your own choice - ie. one that loads your singleton (but
only that singleton, you don't want to share other objects), your cores
should be able to see/share that object

so, as you can see, if you test it and it works, you may fill a JIRA ticket
and help other folks out there (i was too lazy and worked around it in the
past - but that wasn't a good solution). If there a well justified reason
to share objects, it seems weird the core is using 'null' as a parent class
loader

HTH,

  roman






On Sun, Jun 30, 2013 at 2:18 PM, Peyman Faratin wrote:

> I see. If I wanted to try the second option ("find a place inside solr
> before the core is created") then where would that place be in the flow of
> app waking up? Currently what I am doing is each core loads its app caches
> via a requesthandler (in solrconfig.xml) that initializes the java class
> that does the loading. For instance:
>
>  startup="lazy" >
>
>  AppCaches
>
> 
>  class="com.name.Project.AppCaches"/>
>
>
> So each core has its own so specific cachedResources handler. Where in
> SOLR would I need to place the AppCaches code to make it visible to all
> other cores then?
>
> thank you Roman
>
> On Jun 29, 2013, at 10:58 AM, Roman Chyla  wrote:
>
> > Cores can be reloaded, they are inside solrcore loader /I forgot the
> exact
> > name/, and they will have different classloaders /that's servlet thing/,
> so
> > if you want singletons you must load them outside of the core, using a
> > parent classloader - in case of jetty, this means writing your own jetty
> > initialization or config to force shared class loaders. or find a place
> > inside the solr, before the core is created. Google for montysolr to see
> > the example of the first approach.
> >
> > But, unless you really have no other choice, using singletons is IMHO a
> bad
> > idea in this case
> >
> > Roman
> >
> > On 29 Jun 2013 10:18, "Peyman Faratin"  wrote:
> >>
> >> its the singleton pattern, where in my case i want an object (which is
> > RAM expensive) to be a centralized coordinator of application logic.
> >>
> >> thank you
> >>
> >> On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com>
> > wrote:
> >>
> >>> There is very little shared between multiple cores (instanceDir paths,
> >>> logging config maybe?). Why are you trying to do this?
> >>>
> >>> On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin <
> pey...@robustlinks.com>
> > wrote:
>  Hi
> 
>  I have a multicore setup (in 4.3.0). Is it possible for one core to
> > share an instance of its class with other cores at run time? i.e.
> 
>  At run time core 1 makes an instance of object O_i
> 
>  core 1 --> object O_i
>  core 2
>  ---
>  core n
> 
>  then can core K access O_i? I know they can share properties but is it
> > possible to share objects?
> 
>  thank you
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Shalin Shekhar Mangar.
> >>
>
>


Re: dataconfig to index ZIP Files

2013-07-01 Thread ericrs22
To answer the previous Post:

I was not sure what datasource="binaryFile" I took it from a PDF sample
thinking that would help.

after setting datasource="null" I'm still gett the same errors...














the logs report this:

 
INFO  - 2013-07-01 16:45:57.317;
org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
WARN  - 2013-07-01 16:45:57.333;
org.apache.solr.handler.dataimport.SimplePropertiesWriter; Unable to read:
dataimport.properties




--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074399.html
Sent from the Solr - User mailing list archive at Nabble.com.


Does solr cloud required passwordless ssh?

2013-07-01 Thread adfel70
Hi
Does solr cloud on a cluster of servers require passwordless ssh to be
configured between the servers?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-solr-cloud-required-passwordless-ssh-tp4074398.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: documentCache not used in 4.3.1?

2013-07-01 Thread Daniel Collins
Regrettably, visibility is key for us :(  Documents must be searchable as
soon as they have been indexed (or as near as we can make it).  Our old
search system didn't do relevance sort, it was time-ordered (so it had a
much simpler job) but it did have sub-second latency, and that is what is
expected for its replacement (I know Solr doesn't like <1s currently, but
we live in hope!).  Tried explaining that by doing relevance sort we are
searching 100% of the collection, instead of the ~10%-20% a time-ordered
sort did (it effectively sharded by date and only searched as far back as
it needed to fill a page of results), but that tends to get blank looks
from business. :)

One of life's little challenges.


On 1 July 2013 11:10, Erick Erickson  wrote:

> Daniel:
>
> Soft commits invalidate the "top level" caches, which include
> things like filterCache, queryResultCache etc. Various
> "segment-level" caches are NOT invalidated, but you really
> don't have a lot of control from the Solr level over those
> anyway.
>
> But yeah, the tension between caching a bunch of stuff
> for query speedups and NRT is still with us. Soft commits
> are much less expensive than hard commits, but not being
> able to use the caches as much is the price. You're right
> that with such frequent autocommits, autowarming
> probably is not worth the effort.
>
> The question I always ask is whether 1 second is really
> necessary. Or, more accurately, worth the price. Often
> it's not and lengthening it out significantly may be an option,
> but that's a discussion for you to have with your product
> manager 
>
> I have seen configurations that have a more frequent hard
> commit (openSearcher=false) than soft commit. The
> mantra is "soft commits are about visibility, hard commits
> are about durability".
>
> FWIW,
> Erick
>
>
> On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins  >wrote:
>
> > We see similar results, again we softCommit every 1s (trying to get as
> NRT
> > as we can), and we very rarely get any hits in our caches.  As an
> > unscheduled test last week, we did shutdown indexing and noticed about
> 80%
> > hit rate in caches (and average query time dropped from ~1s to 100ms!)
> so I
> > think we are in the same position as you.
> >
> > I appreciate with such a frequent soft commit that the caches get
> > invalidated, but I was expecting cache warming to help though it doesn't
> > appear to be.  We *don't* currently run a warming query, my impression of
> > NRT was that it was better to not do that as otherwise you spend more
> time
> > warming the searcher and caches, and by the time you've done all that,
> the
> > searcher is invalidated anyway!
> >
> >
> > On 30 June 2013 01:58, Tim Vaillancourt  wrote:
> >
> > > That's a good idea, I'll try that next week.
> > >
> > > Thanks!
> > >
> > > Tim
> > >
> > >
> > > On 29/06/13 12:39 PM, Erick Erickson wrote:
> > >
> > >> Tim:
> > >>
> > >> Yeah, this doesn't make much sense to me either since,
> > >> as you say, you should be seeing some metrics upon
> > >> occasion. But do note that the underlying cache only gets
> > >> filled when getting documents to return in query results,
> > >> since there's no autowarming going on it may come and
> > >> go.
> > >>
> > >> But you can test this pretty quickly by lengthening your
> > >> autocommit interval or just not indexing anything
> > >> for a while, then run a bunch of queries and look at your
> > >> cache stats. That'll at least tell you whether it works at all.
> > >> You'll have to have hard commits turned off (or openSearcher
> > >> set to 'false') for that check too.
> > >>
> > >> Best
> > >> Erick
> > >>
> > >>
> > >> On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim<
> tvaillanco...@ea.com
> > >*
> > >> *wrote:
> > >>
> > >>  Yes, we are softCommit'ing every 1000ms, but that should be enough
> time
> > >>> to
> > >>> see metrics though, right? For example, I still get non-cumulative
> > >>> metrics
> > >>> from the other caches (which are also throw away). I've also
> > curl/sampled
> > >>> enough that I probably should have seen a value by now.
> > >>>
> > >>> If anyone else can reproduce this on 4.3.1 I will feel less crazy :).
> > >>>
> > >>> Cheers,
> > >>>
> > >>> Tim
> > >>>
> > >>> -Original Message-
> > >>> From: Erick Erickson [mailto:erickerickson@gmail.**com<
> > erickerick...@gmail.com>
> > >>> ]
> > >>> Sent: Saturday, June 29, 2013 10:13 AM
> > >>> To: solr-user@lucene.apache.org
> > >>> Subject: Re: documentCache not used in 4.3.1?
> > >>>
> > >>> It's especially weird that the hit ratio is so high and you're not
> > seeing
> > >>> anything in the cache. Are you perhaps soft committing frequently?
> Soft
> > >>> commits throw away all the top-level caches including documentCache I
> > >>> think
> > >>>
> > >>> Erick
> > >>>
> > >>>
> > >>> On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourt > **com
> > >>>
> >  wrote:
> >  Thanks Otis,
> > 
> >  Yeah I realized after sending my e-mail 

Concurrent Modification Exception

2013-07-01 Thread adityab
Hi, 
I have recently upgraded from Solr 3.5 to 4.2.1.
Also we have added spellcheck feature to our search query. During our
performance testing we have observed that for every 2000 request, 1 request
fails. 
The exception we observe in solr log are ConcurrentModificationException.
Below is the complete stack for exception. 
Any idea what could potentially be the reason. I did check JIRA list in
Solr/Lucene to see if there is any issue files and that's fixed. Couldn't
filnd thats directly associated to LRUCache. 

thanks
Aditya 

2013-06-28 20:32:57,265 SEVERE [org.apache.solr.core.SolrCore] (http-80-20)
java.util.ConcurrentModificationException
at 
java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
at java.util.AbstractList$Itr.next(AbstractList.java:343)
at java.util.AbstractList.equals(AbstractList.java:506)
at org.apache.solr.search.QueryResultKey.isEqual(QueryResultKey.java:96)
at org.apache.solr.search.QueryResultKey.equals(QueryResultKey.java:81)
at java.util.HashMap.getEntry(HashMap.java:349)
at java.util.LinkedHashMap.get(LinkedHashMap.java:280)
at org.apache.solr.search.LRUCache.get(LRUCache.java:130)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1276)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190)
at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:567)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:662)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Concurrent-Modification-Exception-tp4074371.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to re-index Solr & get term frequency within documents

2013-07-01 Thread Jack Krupansky
You can write any function query in the field list of the "fl" parameter. 
Sounds like you want "termfreq":


termfreq(field_arg,term)

fl=id,a,b,c,termfreq(a,xyz)


-- Jack Krupansky

-Original Message- 
From: Tony Mullins

Sent: Monday, July 01, 2013 10:47 AM
To: solr-user@lucene.apache.org
Subject: How to re-index Solr & get term frequency within documents

Hi,

I am using Solr 4.3.0.
If I change my solr's schema.xml then do I need to re-index my solr ? And
if yes , how to ?

My 2nd question is I need to find the frequency of term per document in all
documents of search result.

My field is



And I am trying this query

http://localhost:8080/solr/select/?q=iphone&fl=AuthorX%2CTitleX%2CCommentX&df=CommentX&wt=xml&indent=true&qt=tvrh&tv=true&tv.tf=true&tv.df=true&tv.positions&tv.offsets=true

Its just returning me the result set, no info on my searched term's
(iphone) frequency in each document.

How can I make Solr to return the frequency of searched term per document
in result set ?

Thanks,
Tony. 



Re: ConcurrentUpdateSolrServer hanging

2013-07-01 Thread qungg
Hi,

BlockUntilFinish block indefinitely sometimes. But if I send a commit from
another thread to the instance, the concurrentUpdateServer unblock and send
the rest of the documents and commit. So the squence look like this:

1. adding documents as usual...
2. finish adding documents...
3. block untill finished... block forever (i try to block before commit,
call this commit 1)
4. from other thread, send a commit (lets call this commit 2)
5. magically unblocked... and flushed out the rest of the documents...
6. commit 1...  
7. commit 2 ... 

The order of commit in 6 and 7 is observed in solr log.

Thanks,
Qun




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrServer-hanging-tp4073620p4074366.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to re-index Solr & get term frequency within documents

2013-07-01 Thread Tony Mullins
Hi,

I am using Solr 4.3.0.
If I change my solr's schema.xml then do I need to re-index my solr ? And
if yes , how to ?

My 2nd question is I need to find the frequency of term per document in all
documents of search result.

My field is

 

And I am trying this query

http://localhost:8080/solr/select/?q=iphone&fl=AuthorX%2CTitleX%2CCommentX&df=CommentX&wt=xml&indent=true&qt=tvrh&tv=true&tv.tf=true&tv.df=true&tv.positions&tv.offsets=true

Its just returning me the result set, no info on my searched term's
(iphone) frequency in each document.

How can I make Solr to return the frequency of searched term per document
in result set ?

Thanks,
Tony.


Re: Distinct values in multivalued fields

2013-07-01 Thread Jack Krupansky
Unfortunately, update processors only "see" the new, fresh, incoming data, 
not any existing document data.


This is a case where your best bet may be to read the document first and 
then merge your new value into the existing list of values.



-- Jack Krupansky
-Original Message- 
From: tuedel

Sent: Monday, July 01, 2013 9:34 AM
To: solr-user@lucene.apache.org
Subject: Distinct values in multivalued fields

Hello everybody,

i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:


  

  title
  tag_type

  
  



  
 uniq_fields
   
 

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an

"tag_type" :{"add":"foo"}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!

In order to achieve this behavior i suggest writing an own processor might
be a solution. However i am uncertain how to do and if it's the proper way.
Imagine an incoming update - e.g. an update of an existing document having
several multivalued fields without specifying "add" or "set". This task
would cause the corresponding document to get dropped and re-indexed without
keeping any previously added values within the multivalued field.
Therefore if a field is getting updated and not having the distinct value
being part of the index yet, shall add the value, otherwise ignore it. The
processor needs to define whether a field is getting added to the index or
not in condition of the existing index. Is that achievable on Solr side?!
Below my current pretty empty processor class:

public class ConditionalSolrUniqFieldValuesProcessorFactory extends
UpdateRequestProcessorFactory {

   @Override
   public UpdateRequestProcessor getInstance(SolrQueryRequest sqr,
SolrQueryResponse sqr1, UpdateRequestProcessor urp) {
   return new ConditionalUniqFieldValuesProcessor(urp);
   }

   class ConditionalUniqFieldValuesProcessor extends UpdateRequestProcessor
{

   public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor
next) {
   super(next);
   }

   @Override
   public void processAdd(AddUpdateCommand cmd) throws IOException {
   SolrInputDocument doc = cmd.getSolrInputDocument();

   Collection incomingFieldNames = doc.getFieldNames();
   for (String t : incomingFieldNames) {
   /*
   is multivalued
   if (doc.getField(t).) {
   If multivalued and already part of index, drop from
index. Otherwise add to multivalued field.
   }
   */
   }

   }
   }
}







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Converting nested data model to solr schema

2013-07-01 Thread Jack Krupansky
Simply duplicate a subset of the fields that you want to query of the parent 
document on each child document and then you can directly query the child 
documents without any join.


Yes, given the complexity of your data, a two-step query process may be 
necessary for some queries - do one query to get parent or child IDs and 
then do a second query filtered by those IDs.


And, yes, this only approximates the full power of an SQL join - but at a 
tiny fraction of the cost.


-- Jack Krupansky

-Original Message- 
From: adfel70

Sent: Monday, July 01, 2013 9:56 AM
To: solr-user@lucene.apache.org
Subject: Converting nested data model to solr schema

Hi,
I have the following data model:
1. Document (fields: doc_id, author, content)
2. Each Document has multiple  attachment types. Each attachment type has
multiple instances. And each attachment type may have different fields.
for example:

  1
  john
  some long long text...
  
 
458
SomeText
12/12/2012
 
 
568
SomeText2
12/11/2012
 
  
  
 
345
SomeText
Jack
22-12-2012
 
 
897
SomeText2
Bob
23-12-2012
 
  


I want to index all this data in solr cloud.
My current solution is to index the original document by its self and index
each attachment as a single solr document with its parent_doc_id, and then
use solr join capability.
The problem with this solution is  that I must index all the attachments of
each document, and the document itself in the same shard (current solr
limitation).
This requires me to override the solr document distribution mechanism.
I fear that with this solution I may loose some of solr cloud's
capabilities.
My questions are:
1. Are my concerns regarding downside of overriding solr cloud's
out-of-the-box mechanism justified? Or should I proceed with this solution?
2. If I'm looking for another solution, can I  somehow keep all attachments
on the same document and be able to query on a single attachment?
A query example:
Retrieve  all documents where:
content: contains "abc"
AND
reply_attachment.author = 'Bob'
AND
reply_attachment.date = '12-12-2012'


Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Distinct values in multivalued fields

2013-07-01 Thread Upayavira
Have a look at the DedupUpdateProcessorFactory, which may help you.
Although, I'm not sure if it works with multivalued fields.

Upayavira

On Mon, Jul 1, 2013, at 02:34 PM, tuedel wrote:
> Hello everybody,
> 
> i have tried to make use of the UniqFieldsUpdateProcessorFactory in 
> order to achieve distinct values in multivalued fields. Example below: 
> 
>  
> class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory"> 
>   
>title 
>tag_type 
>   
> 
> 
>  
> 
>  
> 
>   uniq_fields 
>  
>
> 
> However the data being is indexed one by one. This may happen, since a 
> document may will get an additional tag in a future update. Unfortunately
> in 
> order to ensure not having any duplicate tags, i was hoping, the 
> UpdateProcessorFactory is doing what i want to achieve. In order to
> actually 
> add a tag, i am sending an 
> 
> "tag_type" :{"add":"foo"}, which still adds the tag, without questioning
> if 
> its already part of the field. How may i be able to achieve distinct
> values 
> on solr side?! 
> 
> In order to achieve this behavior i suggest writing an own processor
> might
> be a solution. However i am uncertain how to do and if it's the proper
> way. 
> Imagine an incoming update - e.g. an update of an existing document
> having
> several multivalued fields without specifying "add" or "set". This task
> would cause the corresponding document to get dropped and re-indexed
> without
> keeping any previously added values within the multivalued field. 
> Therefore if a field is getting updated and not having the distinct value
> being part of the index yet, shall add the value, otherwise ignore it.
> The
> processor needs to define whether a field is getting added to the index
> or
> not in condition of the existing index. Is that achievable on Solr side?! 
> Below my current pretty empty processor class:
> 
> public class ConditionalSolrUniqFieldValuesProcessorFactory extends
> UpdateRequestProcessorFactory {
> 
> @Override
> public UpdateRequestProcessor getInstance(SolrQueryRequest sqr,
> SolrQueryResponse sqr1, UpdateRequestProcessor urp) {
> return new ConditionalUniqFieldValuesProcessor(urp);
> }
> 
> class ConditionalUniqFieldValuesProcessor extends
> UpdateRequestProcessor
> {
> 
> public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor
> next) {
> super(next);
> }
> 
> @Override
> public void processAdd(AddUpdateCommand cmd) throws IOException {
> SolrInputDocument doc = cmd.getSolrInputDocument();
> 
> Collection incomingFieldNames = doc.getFieldNames();
> for (String t : incomingFieldNames) {
> /*
> is multivalued
> if (doc.getField(t).) { 
> If multivalued and already part of index, drop from
> index. Otherwise add to multivalued field.
> }
> */
> }
>  
> }
> }
> }
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Converting nested data model to solr schema

2013-07-01 Thread adfel70
Hi,
I have the following data model:
1. Document (fields: doc_id, author, content)
2. Each Document has multiple  attachment types. Each attachment type has
multiple instances. And each attachment type may have different fields.
for example:

   1
   john
   some long long text...
   
  
 458
 SomeText
 12/12/2012
  
  
 568
 SomeText2
 12/11/2012
  
   
   
  
 345
 SomeText
 Jack
 22-12-2012
  
  
 897
 SomeText2
 Bob
 23-12-2012
  
   


I want to index all this data in solr cloud.
My current solution is to index the original document by its self and index
each attachment as a single solr document with its parent_doc_id, and then
use solr join capability.
The problem with this solution is  that I must index all the attachments of
each document, and the document itself in the same shard (current solr
limitation).
This requires me to override the solr document distribution mechanism.
I fear that with this solution I may loose some of solr cloud's
capabilities.
My questions are:
1. Are my concerns regarding downside of overriding solr cloud's
out-of-the-box mechanism justified? Or should I proceed with this solution?
2. If I'm looking for another solution, can I  somehow keep all attachments
on the same document and be able to query on a single attachment?
A query example:
Retrieve  all documents where:
content: contains "abc"
AND
reply_attachment.author = 'Bob'
AND
reply_attachment.date = '12-12-2012'


Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351.html
Sent from the Solr - User mailing list archive at Nabble.com.


Distinct values in multivalued fields

2013-07-01 Thread tuedel
Hello everybody,

i have tried to make use of the UniqFieldsUpdateProcessorFactory in 
order to achieve distinct values in multivalued fields. Example below: 

 

  
   title 
   tag_type 
  


 

 

  uniq_fields 
 
   

However the data being is indexed one by one. This may happen, since a 
document may will get an additional tag in a future update. Unfortunately in 
order to ensure not having any duplicate tags, i was hoping, the 
UpdateProcessorFactory is doing what i want to achieve. In order to actually 
add a tag, i am sending an 

"tag_type" :{"add":"foo"}, which still adds the tag, without questioning if 
its already part of the field. How may i be able to achieve distinct values 
on solr side?! 

In order to achieve this behavior i suggest writing an own processor might
be a solution. However i am uncertain how to do and if it's the proper way. 
Imagine an incoming update - e.g. an update of an existing document having
several multivalued fields without specifying "add" or "set". This task
would cause the corresponding document to get dropped and re-indexed without
keeping any previously added values within the multivalued field. 
Therefore if a field is getting updated and not having the distinct value
being part of the index yet, shall add the value, otherwise ignore it. The
processor needs to define whether a field is getting added to the index or
not in condition of the existing index. Is that achievable on Solr side?! 
Below my current pretty empty processor class:

public class ConditionalSolrUniqFieldValuesProcessorFactory extends
UpdateRequestProcessorFactory {

@Override
public UpdateRequestProcessor getInstance(SolrQueryRequest sqr,
SolrQueryResponse sqr1, UpdateRequestProcessor urp) {
return new ConditionalUniqFieldValuesProcessor(urp);
}

class ConditionalUniqFieldValuesProcessor extends UpdateRequestProcessor
{

public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor
next) {
super(next);
}

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();

Collection incomingFieldNames = doc.getFieldNames();
for (String t : incomingFieldNames) {
/*
is multivalued
if (doc.getField(t).) { 
If multivalued and already part of index, drop from
index. Otherwise add to multivalued field.
}
*/
}
 
}
}
}







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Shard tolerant partial results

2013-07-01 Thread Mark Miller

On Jul 1, 2013, at 6:56 AM, Phil Hoy  wrote:

> Perhaps an http header could be added or another attribute added to the solr 
> result node.

I thought that was already done - I'm surprised that it's not. If that's really 
the case, please make a JIRA issue.

- Mark

Re: Stemming query in Solr

2013-07-01 Thread snkar
I was just wondering if another solution might work. If we are able to extract 
the stem of the input search term(maybe using a C# based stemmer, some open 
source implementation of the Porter algorithm) for cases where the stemming 
option is selected, and submit the query to solr as a multiple character wild 
card query with respect to the stem, it should return me all the different 
variations of the stemmed word.

Example:

Search Term: burning
Stem: burn
Modified Query: burn*
Results: burn, burning, burns, burnt, etc.

I am sure this is not the proper way of executing a stemming by expansion, but 
this might just get the job done. What do you think? Trying to think of test 
case where this will fail.

 On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via 
Lucene] wrote  


 bq:  But looks like it is executing the search for an exact text based 
match with the stem "burn". 

Right. You need to appreciate index time as opposed to query time stemming. 
Your field 
definition has both turned on. The admin/analysis page will help here 
.. 

At index time, the terms are stemmed, and _only_ the reduced term is put in 
the index. 
At query time, the same thing happens and _only_ the reduced term is 
searched for. 

By stemming at index time, you lose the original form of the word, it's 
just gone and 
nothing about checking/unchecking the "stem" bits will recover it. So the 
general 
solution is to index the field twice, once with stemming and once without 
in order 
to have the ability to do both stemmed and exact matches. I think I saw a 
clever 
approach to doing this involving a custom filter but can't find it now. As 
I recall it 
indexed the un-stemmed version like a synonym with some kind of marker 
to indicate exact match when necessary 

Best 
Erick 


On Mon, Jul 1, 2013 at 5:15 AM, snkar <[hidden email]> wrote: 

> Hi Erick, 
> 
> Thanks for the reply. 
> 
> Here is what the situation is: 
> 
> Relevant portion of Solr Schema: 
> <field name="Content" type="text_general" indexed="false" 
stored="true" 
> required="true"/> 
> <field name="ContentSearch" type="text_general" indexed="true" 
> stored="false" multiValued="true"/> 
> <field name="ContentSearchStemming" type="text_stem" indexed="true" 
> stored="false" multiValued="true"/> 
> <copyField source="Content" dest="ContentSearch"/> 
> <copyField source="Content" dest="ContentSearchStemming"/> 
> 
> <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100"> <analyzer type="index"> 
<tokenizer 
> class="solr.StandardTokenizerFactory"/> <filter 
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
> enablePositionIncrements="true" /> <filter 
> class="solr.LowerCaseFilterFactory"/> </analyzer> 
<analyzer 
> type="query"> <tokenizer 
class="solr.StandardTokenizerFactory"/> 
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" /> 
<filter 
> class="solr.LowerCaseFilterFactory"/> </analyzer> 
> </fieldType> 
> 
> <fieldType name="text_stem" class="solr.TextField" > 
> <analyzer> <tokenizer 
class="solr.WhitespaceTokenizerFactory"/> 
> <filter class="solr.SnowballPorterFilterFactory"/> 
</analyzer> 
> </fieldType> 
> When I am indexing a document, the content gets stored as is in the 
> Content field and gets copied over to ContentSearch and 
> ContentSearchStemming for text based search and stemming search 
> respectively. So, the ContentSearchStemming field does store the 
> stem/reduced form of the terms. I have checked this with the Luke as well 
> as the Admin Schema Browser --> Term Info. In the Admin 
> Analysis screen, I have tested and found that if I index the text 
> "burning", it gets reduced to and stored as "burn". So far so good. 
> 
> Now in the UI, 
> lets say the user puts in the term "burn" and checks the stemming option. 
> The expectation is that since the user has specified stemming, the results 
> should be returned for the term "burn" as well as for all terms which has 
> their stem as "burn" i.e. burning, burned, burns, etc. 
> lets say the user puts in the term "burning" and checks the stemming 
> option. The expectation is that since the user has specified stemming, the 
> results should be returned for the term "burning" as well as for all terms 
> which has their stem as "burn" i.e. burn, burned, burns, etc. 
> The query that gets submitted to Solr: q=ContentSearchStemming:burning 
> From Debug Info: 
> <str 
name="rawquerystring">ContentSearchStemming:burning</str> 
> <str 
name="querystring">ContentSearchStemming:burning</str> 
> <str 
name="parsedquery">ContentSearchStemming:burn</str> 
> <str 
> 
name="parsedquery_toString">ContentSearchStemming:burn</str>
 
> So, whe

Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field

2013-07-01 Thread Jack Krupansky
Your stated problem seems to have nothing to do with the message subject 
line relating to RemoveDuplicatesTokenFilterFactory. Please start a new 
message thread unless you really are concerned with an issue related to 
RemoveDuplicatesTokenFilterFactory.


This kind of "thread hijacking" is inappropriate for this email list (or any 
email list.)


-- Jack Krupansky

-Original Message- 
From: tuedel

Sent: Monday, July 01, 2013 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate 
values in multivalued field


Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:


  

  title
  tag_type

  
  



  
 uniq_fields
   
 

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an

"tag_type" :{"add":"foo"}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Stemming query in Solr

2013-07-01 Thread snkar

So the general solution is to index the field twice, once with stemming and 
once without in order to have the ability to do both stemmed and exact matches 

I am already indexing the text twice using the ContentSearch and 
ContentSearchStemming fields. But what this allows me is to return "burning" as 
well as "burn" if the user specifies "burning" as the input search term, 
"burning" being the exact match:

ContentSearch:burning + ContentSearchStemming:burn(reduced from 
ContentSearchStemming:burning)

What I cannot figure out is how is this going to help me in instructing Solr to 
execute the query for the different grammatical variations of the input search 
term stem i.e. stemming query for "burning" expands to text based query for 
"burn", "burns", "burned", "burning", etc.

You mentioned something about synonym. This was also mentioned in the Solr Wiki:
A related technology to stemming is lemmatization, which allows for "stemming" 
by expansion, taking a root word and 'expanding' it to all of its various 
forms. Lemmatization can be used either at insertion time or at query time. 
Lucene/Solr does not have built-in support for lemmatization but it can be 
simulated by using your own dictionaries and the SynonymFilterFactory  

I think what I need is exactly this point:

Lucene/Solr does not have built-in support for lemmatization but it can be 
simulated by using your own dictionaries and the SynonymFilterFactory

But I am not sure, how to go about it and exactly how can Synonym help me here 
as I am not looking for synonyms, rather different expansions of the stemmed 
word.

 On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene] 
 wrote  


 bq:  But looks like it is executing the search for an exact text based 
match with the stem "burn". 

Right. You need to appreciate index time as opposed to query time stemming. 
Your field 
definition has both turned on. The admin/analysis page will help here 
.. 

At index time, the terms are stemmed, and _only_ the reduced term is put in 
the index. 
At query time, the same thing happens and _only_ the reduced term is 
searched for. 

By stemming at index time, you lose the original form of the word, it's 
just gone and 
nothing about checking/unchecking the "stem" bits will recover it. So the 
general 
solution is to index the field twice, once with stemming and once without 
in order 
to have the ability to do both stemmed and exact matches. I think I saw a 
clever 
approach to doing this involving a custom filter but can't find it now. As 
I recall it 
indexed the un-stemmed version like a synonym with some kind of marker 
to indicate exact match when necessary 

Best 
Erick 


On Mon, Jul 1, 2013 at 5:15 AM, snkar <[hidden email]> wrote: 

> Hi Erick, 
> 
> Thanks for the reply. 
> 
> Here is what the situation is: 
> 
> Relevant portion of Solr Schema: 
> <field name="Content" type="text_general" indexed="false" 
stored="true" 
> required="true"/> 
> <field name="ContentSearch" type="text_general" indexed="true" 
> stored="false" multiValued="true"/> 
> <field name="ContentSearchStemming" type="text_stem" indexed="true" 
> stored="false" multiValued="true"/> 
> <copyField source="Content" dest="ContentSearch"/> 
> <copyField source="Content" dest="ContentSearchStemming"/> 
> 
> <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100"> <analyzer type="index"> 
<tokenizer 
> class="solr.StandardTokenizerFactory"/> <filter 
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
> enablePositionIncrements="true" /> <filter 
> class="solr.LowerCaseFilterFactory"/> </analyzer> 
<analyzer 
> type="query"> <tokenizer 
class="solr.StandardTokenizerFactory"/> 
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" /> 
<filter 
> class="solr.LowerCaseFilterFactory"/> </analyzer> 
> </fieldType> 
> 
> <fieldType name="text_stem" class="solr.TextField" > 
> <analyzer> <tokenizer 
class="solr.WhitespaceTokenizerFactory"/> 
> <filter class="solr.SnowballPorterFilterFactory"/> 
</analyzer> 
> </fieldType> 
> When I am indexing a document, the content gets stored as is in the 
> Content field and gets copied over to ContentSearch and 
> ContentSearchStemming for text based search and stemming search 
> respectively. So, the ContentSearchStemming field does store the 
> stem/reduced form of the terms. I have checked this with the Luke as well 
> as the Admin Schema Browser --> Term Info. In the Admin 
> Analysis screen, I have tested and found that if I index the text 
> "burning", it gets reduced to and stored as "burn". So far so good. 
> 
> Now in the UI, 
> lets say the user puts in the term "burn" and checks the stemming option. 
> The expectation is that s

Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky
It's really 100% up to you how you want to come up with the unique key 
values for your documents. What would you like them to be? Just use that. 
Anything (within reason) - anything goes.


But it also comes back to your data model. You absolutely must come up with 
a data model for how you expect to index and query data in Solr before you 
just start throwing random data into Solr.


1. Design your data model.
2. Produce a Solr schema from that data model.
3. Map the raw data from your data sources (e.g., PDF files) to the fields 
in your Solr schema.


That last step includes the ID/key field, but your data model will imply any 
requirements for what the ID/key should be.


To be absolutely clear, it is 100% up to you to design the ID/key for every 
document; Solr does NOT do that for you.


Even if you are just "exploring", at least come up with an "exploratory" 
data model - which includes what expectations you have about the unique 
ID/key for each document.


So, for that first PDF file, what expectation (according to your data model) 
do you have for what its ID/key should be?


-- Jack Krupansky

-Original Message- 
From: archit2112

Sent: Monday, July 01, 2013 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Unique key error while indexing pdf files

Im new to solr. Im just trying to understand and explore various features
offered by solr and their implementations. I would be very grateful if you
could solve my problem with any example of your choice. I just want to learn
how i can index pdf documents using data import handler.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field

2013-07-01 Thread tuedel
Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:


   
 
   title
   tag_type
 
   
   
 

 
   
  uniq_fields

  

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an 

"tag_type" :{"add":"foo"}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Im new to solr. Im just trying to understand and explore various features
offered by solr and their implementations. I would be very grateful if you
could solve my problem with any example of your choice. I just want to learn
how i can index pdf documents using data import handler.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky

It all depends on your data model - tell us more about your data model.

For example, how will users or applications query these documents and what 
will they expect to be able to do with the ID/key for the documents?


How are you expecting to identify documents in your data model?

-- Jack Krupansky

-Original Message- 
From: archit2112

Sent: Monday, July 01, 2013 7:17 AM
To: solr-user@lucene.apache.org
Subject: Unique key error while indexing pdf files

Hi

Im trying to index pdf files in solr 4.3.0 using the data import handler.

*My request handler - *


   
 data-config1.xml
   
 

*My data-config1.xml *















Now When i try and index the files i get the following error -

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)


This problem can be solved easily in case of database indexing but i dont
know how to go about the unique key of a document. how do i define the id
field (unique key) of a pdf file. how do i solve this problem?

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Hi

Im trying to index pdf files in solr 4.3.0 using the data import handler. 

*My request handler - *

 
 
  data-config1.xml 
 
   

*My data-config1.xml *

 
 
 
 
 



 
 
 
 


Now When i try and index the files i get the following error -

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)


This problem can be solved easily in case of database indexing but i dont
know how to go about the unique key of a document. how do i define the id
field (unique key) of a pdf file. how do i solve this problem?

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html
Sent from the Solr - User mailing list archive at Nabble.com.


Shard tolerant partial results

2013-07-01 Thread Phil Hoy
Hi,

When doing distributed searches with shards.tolerant set whilst the hosts for a 
slice are down and therefore the response is partial, how best that inferred as 
we would like to not cache the results upstream and perhaps inform the end user 
in some way.

I am aware that shards.info could be used, however I am concerned this may have 
performance implications due to cost parsing the response from solr and perhaps 
some extra cost incurred by solr to generate the response.

Perhaps an http header could be added or another attribute added to the solr 
result node.

Phil

__
"brightsolid" is used in this email to collectively mean brightsolid online 
innovation limited and its subsidiary companies brightsolid online publishing 
limited and brightsolid online technology limited.
findmypast.co.uk is a brand of brightsolid online publishing limited.
brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
Street, London EC2A 3DQ. Registered in England No. 04369607.
brightsolid online technology limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.

Email Disclaimer

This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of brightsolid shall be 
understood as neither given nor endorsed by it.
__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs
__

Re: Multiple groups of boolean queries in a single query.

2013-07-01 Thread Erick Erickson
Have you tried the query you indicated? Because it should
"just work" barring syntax errors. The only other thing you
might want is to turn on grouping by field type. That'll
return separate sections by type, say the top 3 (default 1)
documents in each type. If you don't group, you have the
possibility that your entire results (i.e. the number of docs
in the &rows parameter) will be all one type.

see:
http://wiki.apache.org/solr/FieldCollapsing

Best
Erick


On Mon, Jul 1, 2013 at 6:06 AM, samabhiK  wrote:

> My entire concern is to be able to make a single query to fetch all the
> types
> of records. If I had to create three different cores for this different
> types of data, I would have to make 3 calls to solr to fetch the entire set
> of data. And I will be having approx 15 such types in real.
>
> Also, at any given record, either the section 1 fields are filled up or
> section 2's or section 3's. At no point, will we have all these fields
> populated in a single record. Only field that will have data for all
> records
> is xyz_category to allow us to partition the data set.
>
> Any suggestions in writing a single query to fetch all the data we need
> will
> be highly appreciated.
>
> Thanks.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294p4074296.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Stemming query in Solr

2013-07-01 Thread Erick Erickson
bq:  But looks like it is executing the search for an exact text based
match with the stem "burn".

Right. You need to appreciate index time as opposed to query time stemming.
Your field
definition has both turned on. The admin/analysis page will help here ..

At index time, the terms are stemmed, and _only_ the reduced term is put in
the index.
At query time, the same thing happens and _only_ the reduced term is
searched for.

By stemming at index time, you lose the original form of the word, it's
just gone and
nothing about checking/unchecking the "stem" bits will recover it. So the
general
solution is to index the field twice, once with stemming and once without
in order
to have the ability to do both stemmed and exact matches. I think I saw a
clever
approach to doing this involving a custom filter but can't find it now. As
I recall it
indexed the un-stemmed version like a synonym with some kind of marker
to indicate exact match when necessary

Best
Erick


On Mon, Jul 1, 2013 at 5:15 AM, snkar  wrote:

> Hi Erick,
>
> Thanks for the reply.
>
> Here is what the situation is:
>
> Relevant portion of Solr Schema:
>  required="true"/>
>  stored="false" multiValued="true"/>
>  stored="false" multiValued="true"/>
> 
> 
>
>  positionIncrementGap="100">   class="solr.StandardTokenizerFactory"/>  class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
> enablePositionIncrements="true" />  class="solr.LowerCaseFilterFactory"/>   type="query"> 
>  words="stopwords.txt" enablePositionIncrements="true" />  class="solr.LowerCaseFilterFactory"/> 
> 
>
> 
>  
>  
> 
> When I am indexing a document, the content gets stored as is in the
> Content field and gets copied over to ContentSearch and
> ContentSearchStemming for text based search and stemming search
> respectively. So, the ContentSearchStemming field does store the
> stem/reduced form of the terms. I have checked this with the Luke as well
> as the Admin Schema Browser --> Term Info. In the Admin
> Analysis screen, I have tested and found that if I index the text
> "burning", it gets reduced to and stored as "burn". So far so good.
>
> Now in the UI,
> lets say the user puts in the term "burn" and checks the stemming option.
> The expectation is that since the user has specified stemming, the results
> should be returned for the term "burn" as well as for all terms which has
> their stem as "burn" i.e. burning, burned, burns, etc.
> lets say the user puts in the term "burning" and checks the stemming
> option. The expectation is that since the user has specified stemming, the
> results should be returned for the term "burning" as well as for all terms
> which has their stem as "burn" i.e. burn, burned, burns, etc.
> The query that gets submitted to Solr: q=ContentSearchStemming:burning
> From Debug Info:
> ContentSearchStemming:burning
> ContentSearchStemming:burning
> ContentSearchStemming:burn
>  name="parsedquery_toString">ContentSearchStemming:burn
> So, when the results are returned, I am only getting the hits highlighted
> with the term "burn", though the same document contains terms like burning
> and
> burns.
>
> I thought that the stemming should work like this:
> The stemming filter in the queryanalyzer chain would reduce the input word
> to its stem. burning --> burn
> The query component should scan through the terms and match those terms
> for which it finds a match between the stem of the term with the stem of
> the input term. burns --> burn (matches) burning --> burn
> The first point is happening. But looks like it is executing the search
> for an exact text based match with the stem "burn". Hence, burns or burned
> are not getting returned.
> Hope I was able to make myself clear.
>
>  On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] &
> lt;ml-node+s472066n4073901...@n3.nabble.com> wrote 
>
>
>  First, this is for the Java version, I hope it extends to C#.
>
> But in your configuration, when you're indexing the stemmer
> should be storing the reduced form in the index. Then, when
> searching, the search should be against the reduced term.
> To c

Re: Index pdf files.

2013-07-01 Thread archit2112
I figured it out. It was a problem with the regular expression i used in
data-config.xml .




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074304.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index pdf files.

2013-07-01 Thread Erick Erickson
OK, have you done anything custom? You get
this where? solr logs? Echoed back in the browser?
In response to what command?

You haven't provided enough info to help us help you.
You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick


On Mon, Jul 1, 2013 at 6:08 AM, archit2112  wrote:

> Hi
>
> Thanks a lot. I did what you said. Now I'm getting the following error.
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> java.util.regex.PatternSyntaxException: Dangling meta character '*' near
> index 0
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: documentCache not used in 4.3.1?

2013-07-01 Thread Erick Erickson
Daniel:

Soft commits invalidate the "top level" caches, which include
things like filterCache, queryResultCache etc. Various
"segment-level" caches are NOT invalidated, but you really
don't have a lot of control from the Solr level over those
anyway.

But yeah, the tension between caching a bunch of stuff
for query speedups and NRT is still with us. Soft commits
are much less expensive than hard commits, but not being
able to use the caches as much is the price. You're right
that with such frequent autocommits, autowarming
probably is not worth the effort.

The question I always ask is whether 1 second is really
necessary. Or, more accurately, worth the price. Often
it's not and lengthening it out significantly may be an option,
but that's a discussion for you to have with your product
manager 

I have seen configurations that have a more frequent hard
commit (openSearcher=false) than soft commit. The
mantra is "soft commits are about visibility, hard commits
are about durability".

FWIW,
Erick


On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins wrote:

> We see similar results, again we softCommit every 1s (trying to get as NRT
> as we can), and we very rarely get any hits in our caches.  As an
> unscheduled test last week, we did shutdown indexing and noticed about 80%
> hit rate in caches (and average query time dropped from ~1s to 100ms!) so I
> think we are in the same position as you.
>
> I appreciate with such a frequent soft commit that the caches get
> invalidated, but I was expecting cache warming to help though it doesn't
> appear to be.  We *don't* currently run a warming query, my impression of
> NRT was that it was better to not do that as otherwise you spend more time
> warming the searcher and caches, and by the time you've done all that, the
> searcher is invalidated anyway!
>
>
> On 30 June 2013 01:58, Tim Vaillancourt  wrote:
>
> > That's a good idea, I'll try that next week.
> >
> > Thanks!
> >
> > Tim
> >
> >
> > On 29/06/13 12:39 PM, Erick Erickson wrote:
> >
> >> Tim:
> >>
> >> Yeah, this doesn't make much sense to me either since,
> >> as you say, you should be seeing some metrics upon
> >> occasion. But do note that the underlying cache only gets
> >> filled when getting documents to return in query results,
> >> since there's no autowarming going on it may come and
> >> go.
> >>
> >> But you can test this pretty quickly by lengthening your
> >> autocommit interval or just not indexing anything
> >> for a while, then run a bunch of queries and look at your
> >> cache stats. That'll at least tell you whether it works at all.
> >> You'll have to have hard commits turned off (or openSearcher
> >> set to 'false') for that check too.
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim >*
> >> *wrote:
> >>
> >>  Yes, we are softCommit'ing every 1000ms, but that should be enough time
> >>> to
> >>> see metrics though, right? For example, I still get non-cumulative
> >>> metrics
> >>> from the other caches (which are also throw away). I've also
> curl/sampled
> >>> enough that I probably should have seen a value by now.
> >>>
> >>> If anyone else can reproduce this on 4.3.1 I will feel less crazy :).
> >>>
> >>> Cheers,
> >>>
> >>> Tim
> >>>
> >>> -Original Message-
> >>> From: Erick Erickson [mailto:erickerickson@gmail.**com<
> erickerick...@gmail.com>
> >>> ]
> >>> Sent: Saturday, June 29, 2013 10:13 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: documentCache not used in 4.3.1?
> >>>
> >>> It's especially weird that the hit ratio is so high and you're not
> seeing
> >>> anything in the cache. Are you perhaps soft committing frequently? Soft
> >>> commits throw away all the top-level caches including documentCache I
> >>> think
> >>>
> >>> Erick
> >>>
> >>>
> >>> On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourt **com
> >>>
>  wrote:
>  Thanks Otis,
> 
>  Yeah I realized after sending my e-mail that doc cache does not warm,
>  however I'm still lost on why there are no other metrics.
> 
>  Thanks!
> 
>  Tim
> 
> 
>  On 28 June 2013 16:22, Otis Gospodnetic otis.gospodne...@gmail.com>
>  >
>  wrote:
> 
>   Hi Tim,
> >
> > Not sure about the zeros in 4.3.1, but in SPM we see all these
> > numbers are non-0, though I haven't had the chance to confirm with
> >
>  Solr 4.3.1.
> >>>
>  Note that you can't really autowarm document cache...
> >
> > Otis
> > --
> > Solr&  ElasticSearch Support -- http://sematext.com/ Performance
> >
> > Monitoring -- http://sematext.com/spm
> >
> >
> >
> > On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt
> > 
> > wrote:
> >
> >> Hey guys,
> >>
> >> This has to be a stupid question/I must be doing something wrong,
> >> but
> >>
> > after
> >
> >> frequent load testing with documentCache enabled under Solr 4.3.1
> >> with auto

Re: Set spellcheck field on query time?

2013-07-01 Thread Jan Høydahl
Check out http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.dictionary 
- you can define multiple dictionaries in the same handler, each with its own 
source field.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

1. juli 2013 kl. 11:34 skrev Timo Schmidt :

> Hello together,
>  
> we are currently working on a mutilanguage single core setup.
>  
> During that I stumbled upon the question if it is possible to define 
> different sources for the spellcheck.
>  
> For now I only see the possibility to define different request handlers. Is 
> it somehow possible to set
> the source field for the DirectSolrSpellChecker on querytime?
>  
> Cheers
>  
> timo
>  
> 
> Timo Schmidt
> Entwickler (Dipl. Inf. FH)
> 
> 
> AOE GmbH
> Borsigstr. 3
> 65205 Wiesbaden
> Germany
> Tel. +49 (0) 6122 70 70 7 - 234
> Fax. +49 (0) 6122 70 70 7 -199
> 
> 
> 
> e-Mail: timo.schm...@aoemedia.de
> Web: http://www.aoemedia.de/
> 
> Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
> 
> USt-ID Nr.: DE250247455
> Handelsregister: Wiesbaden B
> Handelsregister Nr.: 22567
>  
> Stammsitz: Wiesbaden
> Creditreform: 625.0209354
> Geschäftsführer: Kian Toyouri Gould
>  
> Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte 
> Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
> irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und 
> vernichten Sie diese Mail. This e-mail message may contain confidential 
> and/or privileged information. If you are not the intended recipient (or have 
> received this e-mail in error) please notify the sender immediately and 
> destroy this e-mail.
>  
>  



Re: Index pdf files.

2013-07-01 Thread archit2112
Hi 

Thanks a lot. I did what you said. Now I'm getting the following error.

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
java.util.regex.PatternSyntaxException: Dangling meta character '*' near
index 0



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple groups of boolean queries in a single query.

2013-07-01 Thread samabhiK
My entire concern is to be able to make a single query to fetch all the types
of records. If I had to create three different cores for this different
types of data, I would have to make 3 calls to solr to fetch the entire set
of data. And I will be having approx 15 such types in real.

Also, at any given record, either the section 1 fields are filled up or
section 2's or section 3's. At no point, will we have all these fields
populated in a single record. Only field that will have data for all records
is xyz_category to allow us to partition the data set.

Any suggestions in writing a single query to fetch all the data we need will
be highly appreciated.

Thanks.
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294p4074296.html
Sent from the Solr - User mailing list archive at Nabble.com.


Multiple groups of boolean queries in a single query.

2013-07-01 Thread samabhiK
Hello friends,

I have a schema which contains various types of records of three different
categories for ease of management and for making a single query to fetch all
the data. The fields are grouped into three different types of records. For
example:

fields type 1:





fields type 2:





fields type 3:




common partition field which identifies the category of the data record



What should I do to fetch all these records in the form: 

(+x_date:[2011-01-01T00:00:00Z TO *] +x_type:(1 OR 2 OR 3 OR 4)
+xyz_category:X) OR
(+y_date:[2012-06-01T00:00:00Z TO *] +y_name:sam~ +xyz_category:Y) OR
(+z_date:[2013-03-01T00:00:00Z TO *] +xyz_category:Z)

Can we construct a query like this? Or is it even possible?

Sam



 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294.html
Sent from the Solr - User mailing list archive at Nabble.com.


Sum as a Projection for Facet Queries

2013-07-01 Thread samarth s
Hi,

We have a need of finding the sum of a field for each facet.query. We have
looked at StatsComponent  but
that supports only facet.field. Has anyone written a patch over
StatsComponent that supports the same along with some performance measures?

Is there any way we can do this using the Function Query -
Sum
?

-- 
Regards,
Samarth


Set spellcheck field on query time?

2013-07-01 Thread Timo Schmidt
Hello together,

we are currently working on a mutilanguage single core setup.

During that I stumbled upon the question if it is possible to define different 
sources for the spellcheck.

For now I only see the possibility to define different request handlers. Is it 
somehow possible to set
the source field for the DirectSolrSpellChecker on querytime?

Cheers

timo

[cid:image001.jpg@01CE764E.E6958B90]

Timo Schmidt
Entwickler (Dipl. Inf. FH)


AOE GmbH
Borsigstr. 3
65205 Wiesbaden
Germany

Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/


Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a

USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567


Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould


Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte 
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und 
vernichten Sie diese Mail. This e-mail message may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this e-mail in error) please notify the sender immediately and destroy this 
e-mail.





Re: Stemming query in Solr

2013-07-01 Thread snkar
Hi Erick,

Thanks for the reply.

Here is what the situation is:

Relevant portion of Solr Schema:






        
   

  
   

When I am indexing a document, the content gets stored as is in the Content 
field and gets copied over to ContentSearch and ContentSearchStemming for text 
based search and stemming search respectively. So, the ContentSearchStemming 
field does store the
stem/reduced form of the terms. I have checked this with the Luke as well as 
the Admin Schema Browser --> Term Info. In the Admin
Analysis screen, I have tested and found that if I index the text "burning", it 
gets reduced to and stored as "burn". So far so good.

Now in the UI, 
lets say the user puts in the term "burn" and checks the stemming option. The 
expectation is that since the user has specified stemming, the results should 
be returned for the term "burn" as well as for all terms which has their stem 
as "burn" i.e. burning, burned, burns, etc.
lets say the user puts in the term "burning" and checks the stemming option. 
The expectation is that since the user has specified stemming, the results 
should be returned for the term "burning" as well as for all terms which has 
their stem as "burn" i.e. burn, burned, burns, etc.
The query that gets submitted to Solr: q=ContentSearchStemming:burning
>From Debug Info: 
ContentSearchStemming:burning
ContentSearchStemming:burning
ContentSearchStemming:burn
ContentSearchStemming:burn
So, when the results are returned, I am only getting the hits highlighted with 
the term "burn", though the same document contains terms like burning and 
burns.

I thought that the stemming should work like this: 
The stemming filter in the queryanalyzer chain would reduce the input word to 
its stem. burning --> burn
The query component should scan through the terms and match those terms for 
which it finds a match between the stem of the term with the stem of the input 
term. burns --> burn (matches) burning --> burn
The first point is happening. But looks like it is executing the search for an 
exact text based match with the stem "burn". Hence, burns or burned are not 
getting returned.
Hope I was able to make myself clear.

 On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] 
 wrote  


 First, this is for the Java version, I hope it extends to C#. 

But in your configuration, when you're indexing the stemmer 
should be storing the reduced form in the index. Then, when 
searching, the search should be against the reduced term. 
To check this, try 
1> Using the Admin/Analysis page to see what gets stored 
 in your index and what your query is transformed to to 
 insure that you're getting what you expect. 

If you want to get in deeper to the details, try 
1> use, say, the TermsComponent or Admin/Schema Browser 
 or Luke to look in your index and see what's actually 
there. 
2> us &debug=query or Admin/Analysis to see what the query 
actually looks like. 

Both your use-cases should work fine just with reduction 
_unless_ the particular word you look for doesn't happen to 
trip the stemmer. By that I mean that since it's algorithmically 
based, there may be some edge cases that seem like they 
should be reduced that aren't. I don't know whether "fisherman" 
would reduce to "fish" for instance. 

So are you seeing things that really don't work as expected or 
are you just working from the docs? Because I really don't 
see why you wouldn't get what you want given your description. 

Best 
Erick 


On Fri, Jun 28, 2013 at 2:33 AM, snkar <[hidden email]> wrote: 

> We have a search system based on Solr using the Solrnet library in C# 
wh

Re: Index pdf files.

2013-07-01 Thread Shalin Shekhar Mangar
The tika jars are not in your classpath. You need to add all the jars
inside contrib/extraction/lib directory to your classpath.

On Mon, Jul 1, 2013 at 2:00 PM, archit2112  wrote:
> Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler.
> Im using Solr-4.3.0. I followed the steps given in this post
>
> http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html
>
> However, I get the following error -
>
> Full Import failed:java.lang.NoClassDefFoundError:
> org/apache/tika/parser/Parser
>
> Please help!
>
> Thanks
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


Index pdf files.

2013-07-01 Thread archit2112
Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler.
Im using Solr-4.3.0. I followed the steps given in this post

http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html

However, I get the following error -

Full Import failed:java.lang.NoClassDefFoundError:
org/apache/tika/parser/Parser

Please help!

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: dataconfig to index ZIP Files

2013-07-01 Thread Bernd Fehling
Try setting dataSource="null" for your toplevel entity and
use filename="\.zip$" as filename selector.



Am 28.06.2013 23:14, schrieb ericrs22:
> unfortunately not. I had tried that before with the logs saying:
> 
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> java.util.regex.PatternSyntaxException: Dangling meta character '*' near
> index 0 
> 
> 
> With .*zip i get this:
> 
> 
> WARN
>  
> SimplePropertiesWriter
>  
> Unable to read: dataimport.properties
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074009.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: documentCache not used in 4.3.1?

2013-07-01 Thread Daniel Collins
We see similar results, again we softCommit every 1s (trying to get as NRT
as we can), and we very rarely get any hits in our caches.  As an
unscheduled test last week, we did shutdown indexing and noticed about 80%
hit rate in caches (and average query time dropped from ~1s to 100ms!) so I
think we are in the same position as you.

I appreciate with such a frequent soft commit that the caches get
invalidated, but I was expecting cache warming to help though it doesn't
appear to be.  We *don't* currently run a warming query, my impression of
NRT was that it was better to not do that as otherwise you spend more time
warming the searcher and caches, and by the time you've done all that, the
searcher is invalidated anyway!


On 30 June 2013 01:58, Tim Vaillancourt  wrote:

> That's a good idea, I'll try that next week.
>
> Thanks!
>
> Tim
>
>
> On 29/06/13 12:39 PM, Erick Erickson wrote:
>
>> Tim:
>>
>> Yeah, this doesn't make much sense to me either since,
>> as you say, you should be seeing some metrics upon
>> occasion. But do note that the underlying cache only gets
>> filled when getting documents to return in query results,
>> since there's no autowarming going on it may come and
>> go.
>>
>> But you can test this pretty quickly by lengthening your
>> autocommit interval or just not indexing anything
>> for a while, then run a bunch of queries and look at your
>> cache stats. That'll at least tell you whether it works at all.
>> You'll have to have hard commits turned off (or openSearcher
>> set to 'false') for that check too.
>>
>> Best
>> Erick
>>
>>
>> On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim*
>> *wrote:
>>
>>  Yes, we are softCommit'ing every 1000ms, but that should be enough time
>>> to
>>> see metrics though, right? For example, I still get non-cumulative
>>> metrics
>>> from the other caches (which are also throw away). I've also curl/sampled
>>> enough that I probably should have seen a value by now.
>>>
>>> If anyone else can reproduce this on 4.3.1 I will feel less crazy :).
>>>
>>> Cheers,
>>>
>>> Tim
>>>
>>> -Original Message-
>>> From: Erick Erickson 
>>> [mailto:erickerickson@gmail.**com
>>> ]
>>> Sent: Saturday, June 29, 2013 10:13 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: documentCache not used in 4.3.1?
>>>
>>> It's especially weird that the hit ratio is so high and you're not seeing
>>> anything in the cache. Are you perhaps soft committing frequently? Soft
>>> commits throw away all the top-level caches including documentCache I
>>> think
>>>
>>> Erick
>>>
>>>
>>> On Fri, Jun 28, 2013 at 7:23 PM, Tim 
>>> Vaillancourt
>>>
 wrote:
 Thanks Otis,

 Yeah I realized after sending my e-mail that doc cache does not warm,
 however I'm still lost on why there are no other metrics.

 Thanks!

 Tim


 On 28 June 2013 16:22, Otis 
 Gospodnetic
 >
 wrote:

  Hi Tim,
>
> Not sure about the zeros in 4.3.1, but in SPM we see all these
> numbers are non-0, though I haven't had the chance to confirm with
>
 Solr 4.3.1.
>>>
 Note that you can't really autowarm document cache...
>
> Otis
> --
> Solr&  ElasticSearch Support -- http://sematext.com/ Performance
>
> Monitoring -- http://sematext.com/spm
>
>
>
> On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt
> 
> wrote:
>
>> Hey guys,
>>
>> This has to be a stupid question/I must be doing something wrong,
>> but
>>
> after
>
>> frequent load testing with documentCache enabled under Solr 4.3.1
>> with autoWarmCount=150, I'm noticing that my documentCache metrics
>> are
>>
> always

> zero for non-cumlative.
>>
>> At first I thought my commit rate is fast enough I just never see
>> the non-cumlative result, but after 100s of samples I still always
>> get zero values.
>>
>> Here is the current output of my documentCache from Solr's admin
>> for 1
>>
> core:
>
>> "
>>
>> - documentCache<
>>
> http://localhost:8983/solr/#/**channels_shard1_replica2/**
 plugins/cache?en
 try=documentCache

>- class:org.apache.solr.search.**LRUCache
>>- version:1.0
>>- description:LRU Cache(maxSize=512, initialSize=512,
>>autowarmCount=150, regenerator=null)
>>- src:$URL: https:/
>>/svn.apache.org/repos/asf/**lucene/dev/branches/lucene_**
>> solr_4_3/
>>solr/core/src/java/org/apache/**solr/search/LRUCache.java<
>>
> https://svn.apache.org/repos/**asf/lucene/dev/branches/**
 lucene_solr_4_3/s
 olr/core/src/java/org/apache/**solr/search/LRUCache.java

> $