StoredField

2013-03-17 Thread eksdev
is there any way to get ByteRef from a field originally stored as String?

I am playing with Sorter to implement  StoredDocSorter, analogous to 
NumericDocValuesSorter.  But  realised I do not need ByteRef -  String 
conversion just to compare fields  (byte order would be as good for sorting)

StoredDocument d1 = reader.document(docID1, fieldNamesSet);
String value1 = d1.get(fieldName)
String value1 = d1.getStringAsBytesValue(fieldName)// would love to have it

I need String type in other places, so indexing as byte[] would be too much 
hassle.

String is internally stored as byte[], no reason not to expose it for 
StoredField (or any other type)? 




Re: StoredField

2013-03-17 Thread eksdev
Shai,  was that irony or I am missing something big time?

I would like to spare BytesRef - String conversion, not to introduce another 
one back to BytesRef

Simply, for sorting, you do not need to do this byte[]-String conversion, byte 
representation of the String is perfectly sortable… 
 

On Mar 17, 2013, at 1:53 PM, Shai Erera ser...@gmail.com wrote:

 You can do new BytesRef(d1.get(fieldName)).
 
 Shai
 
 
 On Sun, Mar 17, 2013 at 2:43 PM, eksdev eks...@googlemail.com wrote:
 is there any way to get ByteRef from a field originally stored as String?
 
 I am playing with Sorter to implement  StoredDocSorter, analogous to 
 NumericDocValuesSorter.  But  realised I do not need ByteRef -  String 
 conversion just to compare fields  (byte order would be as good for sorting)
 
 StoredDocument d1 = reader.document(docID1, fieldNamesSet);
 String value1 = d1.get(fieldName)
 String value1 = d1.getStringAsBytesValue(fieldName)// would love to have it
 
 I need String type in other places, so indexing as byte[] would be too much 
 hassle.
 
 String is internally stored as byte[], no reason not to expose it for 
 StoredField (or any other type)? 
 
 
 



Re: StoredField

2013-03-17 Thread eksdev
sure, there is a way to make anything - byte[] ;)

it looks like this byte[]-type conversion is done deep-down and this visitor 
user-api gets already correct types  … 

Maybe an idea would be to delay byte[] - type conversion to field access time, 
i do not know what mines would be on the road to do it. 

use cases that require identity checks, or not locale specific sorting and co 
would benefit from having row, serialised representations without type 
conversion…. anyhow, I could switch overt to byte[] fields completely to do ii…

Thanks for responding!  




On Mar 17, 2013, at 2:24 PM, Shai Erera ser...@gmail.com wrote:

 No no, not irony at all. I misunderstood the first time. You wrote is there 
 any way to get ByteRef from a field originally stored as String?, so I 
 understand the first thing that came to mind :).
 
 But I understand the question now -- you say that since the String field is 
 written as byte[] in the file, you want to read the byte[] as they are, 
 without translating them to String. right?
 
 I don't know if it's possible. I'd try field.binaryValue(), though looking at 
 the impl it doesn't suggest it will do what you want.
 
 Shai
 
 
 On Sun, Mar 17, 2013 at 3:02 PM, eksdev eks...@googlemail.com wrote:
 Shai,  was that irony or I am missing something big time?
 
 I would like to spare BytesRef - String conversion, not to introduce another 
 one back to BytesRef
 
 Simply, for sorting, you do not need to do this byte[]-String conversion, 
 byte representation of the String is perfectly sortable… 
 
  
 
 On Mar 17, 2013, at 1:53 PM, Shai Erera ser...@gmail.com wrote:
 
 You can do new BytesRef(d1.get(fieldName)).
 
 Shai
 
 
 On Sun, Mar 17, 2013 at 2:43 PM, eksdev eks...@googlemail.com wrote:
 is there any way to get ByteRef from a field originally stored as String?
 
 I am playing with Sorter to implement  StoredDocSorter, analogous to 
 NumericDocValuesSorter.  But  realised I do not need ByteRef -  String 
 conversion just to compare fields  (byte order would be as good for sorting)
 
 StoredDocument d1 = reader.document(docID1, fieldNamesSet);
 String value1 = d1.get(fieldName)
 String value1 = d1.getStringAsBytesValue(fieldName)// would love to have it
 
 I need String type in other places, so indexing as byte[] would be too much 
 hassle.
 
 String is internally stored as byte[], no reason not to expose it for 
 StoredField (or any other type)? 
 
 
 
 
 



Re: StoredField

2013-03-17 Thread eksdev
Hi Adrian, 
I cannot tell if such thing would make it less or more robust, just thinking 
aloud  :)

I am thinking of it as a way to somehow postpone byte-type conversion to the 
moment where it is really needed.  Simply, keep byte[] around as long as 
possible.   
*Theoretically*, this should improve gc() and memory footprint for some types 
of downstream processing. It all depends how easy would something like that be.

There is already a way to achieve this by using binary field type, …  hmmm, 
maybe some lucene.expert hack to make Lucene think every field is binary wold 
be simple and robust enough? 
e.g. Visitor.transportOnlySerializedValuesWithoutTypeConversion()

-

By the way, the trick with tim-sort in Sorter worked great. For 1.1 Mio short 
documents, the time to sort unsorted index on handful of stored fields went 
from 490 seconds to 380. 
Congrats and thanks for it! It also improved compression by 12% (very small, 4k 
chunk size)

On Mar 17, 2013, at 5:26 PM, Adrien Grand jpou...@gmail.com wrote:

 Hi,
 
 On Sun, Mar 17, 2013 at 2:58 PM, eksdev eks...@googlemail.com wrote:
 sure, there is a way to make anything - byte[] ;)
 
 it looks like this byte[]-type conversion is done deep-down and this
 visitor user-api gets already correct types  …
 
 Maybe an idea would be to delay byte[] - type conversion to field access
 time, i do not know what mines would be on the road to do it.
 
 use cases that require identity checks, or not locale specific sorting and
 co would benefit from having row, serialised representations without type
 conversion…. anyhow, I could switch overt to byte[] fields completely to do
 ii…
 
 I understand that it is frustrating to perform a String - byte[]
 conversion if Lucene just did the opposite. But because it needs to
 perform one random seek per document (on a file which is often large),
 the stored fields API is much slower than a String - UTF-8 bytes
 conversion, so I think we should keep the API robust rather than
 allowing for these kinds of optimizations?
 
 -- 
 Adrien
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs

2013-03-06 Thread eksdev
humpf, do we actually need stored fields?

What is wrong with having byte[] DV  that stores all document  fields,  e.g.  
avro or something simpler  to serialise all document fields into one byte[]?

I am definitely missing something about DV/Stored fields diff,  not sure what?



On Mar 6, 2013, at 8:18 PM, Andrzej Bialecki  (JIRA) j...@apache.org wrote:

 
[ 
 https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594997#comment-13594997
  ] 
 
 Andrzej Bialecki  commented on LUCENE-3918:
 ---
 
 bq. I still don't get why someone would use stored fields rather than doc 
 values (either binary, sorted or numeric) to sort his index. I think it's 
 important to make users understand that stored fields are only useful to 
 display results?
 
 This is a legacy of the original usage of this tool in Nutch - indexes would 
 use a PageRank value as a document boost, and that was the value to be used 
 for sorting - but since the doc boost is not recoverable from an existing 
 index the value itself was stored in a stored field.
 
 And definitely DV didn't exist yet at that time :)
 
 Port index sorter to trunk APIs
 ---
 
Key: LUCENE-3918
URL: https://issues.apache.org/jira/browse/LUCENE-3918
Project: Lucene - Core
 Issue Type: Task
 Components: modules/other
   Affects Versions: 4.0-ALPHA
   Reporter: Robert Muir
Fix For: 4.2, 5.0
 
Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch
 
 
 LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
 functionality to 4.0 apis.
 
 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Win7 64bit, jvm 7 and MMAP OOM

2013-01-22 Thread eksdev
Just to share some experience if someone hits the same problem.  

We had huge problems on Win7 64bit, JVM 64bit 1.7.0_07,  (a few days old trunk 
version, 5.0, default codec) solr under tomcat thread queue limited to 20 . 
NRTCaching and MMAP have the same problems (no updates , just search). Unmap 
hack was on

Problem:
After hitting it for 15 minutes with 10 search threads, gc() did not manage  to 
catch-up and free enough memory and the server repeatably spiralled to OOM 
(giving more max heap did not help as well). Server is running on 8 cores, and 
10 client threads are normally not an issue.

Observations:
- Tweaking jvm memory and gc() options did not help at all.  
- exactly the  same configuration and tests on 3 linux  flavours  had 
absolutely no problems. 
- Win using FSDirectory works slower, but stable 
- When OOM spiralling happens, major culprits are, by occupied memory and Noo 
of instances: 
o.a.l.util.WeakIdentityMap$IdentityWeakReference   
java.util.concurrent.ConcurrentHashMap$HashEntry
java.util.concurrent.ConcurrentHashMap$HashEntry[]
- if pausing search requests for really long time (5 to 10 minutes!) these 
references get eventually released.
--


I know java+MMAP on win platforms has problems (slowly releasing mapped 
regions), but I did not expect it is that bad, to the point of being useless.   
It is not an itch currently, all our production is on linux, but if someone has 
an idea how to work around it,  we would be glad to try it.  
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (SOLR-4032) Unable to replicate between nodes ( read past EOF)

2012-11-27 Thread eksdev
thanks  Mark! 


On Nov 27, 2012, at 8:43 PM, Mark Miller (JIRA) j...@apache.org wrote:

 
[ 
 https://issues.apache.org/jira/browse/SOLR-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504875#comment-13504875
  ] 
 
 Mark Miller commented on SOLR-4032:
 ---
 
 I'll try and make a fix soon.
 
 Unable to replicate between nodes ( read past EOF)
 --
 
Key: SOLR-4032
URL: https://issues.apache.org/jira/browse/SOLR-4032
Project: Solr
 Issue Type: Bug
 Components: SolrCloud
   Affects Versions: 4.0
Environment: 5.0-SNAPSHOT 1366361:1404534M - markus - 2012-11-01 
 12:37:38
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
   Reporter: Markus Jelsma
   Assignee: Mark Miller
Fix For: 4.1, 5.0
 
 
 Please see: 
 http://lucene.472066.n3.nabble.com/trunk-is-unable-to-replicate-between-nodes-Unable-to-download-completely-td4017049.html
  and 
 http://lucene.472066.n3.nabble.com/Possible-memory-leak-in-recovery-td4017833.html
 
 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Compressed stored fields and multiGet(sorted luceneId[])?

2012-11-08 Thread eksdev

On Nov 8, 2012, at 11:30 AM, Robert Muir rcm...@gmail.com wrote:

Thanks everybody for response, and  much more of the same for the great project


 Why are you retrieving thousands of stored fields?


 I do not  think it is all that rare that people  actually do something with 
information  but display summaries?  
Clustering in solr does exactly that,  online record linkage follows exactly 
the same pattern. 
 
A pattern fetch thousands of candidates and run some heavy processing on them 
is surely not  a typical web search engine  usage, but  philosophically,  a 
model:
a) search data 
b) do something with it
c) deliver 
is not that strange?

You say, b) should not be done using stored fields, ok I trust you, but going 
to database/nosql/anything  is even slower. What approach would you recommend? 


the probability of two documents of the same results page being in the same 
chunk is very low.

Adrian, Robert, this is 100% correct, no objection there.   
In this particular case we are using locality of reference heavily.  We simply 
sort the data and reindex from time to time. You have to be lucky to be able to 
sort the documents, but  we do not use lucene for big chunks of text, rather 
for almost fully structured data and we know how to sort this data to preserve 
locality of reference… Also a bit unusual, but  I do not think all that rare 
scenario. 
Sorting data (where possible) was a great optimisation tip for many 
applications, even before compression.


really you should roll your own codec for this and specialise.

Yes, already started thinking about it, but we  will first try to play with 
chunk size to see if we can achieve the goal without own codec …



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Compressed stored fields and multiGet(sorted luceneId[])?

2012-11-07 Thread eksdev
Just a theoretical question, would it make sense to add some sort of 
StoredDocument[] bulkGet(int[] docId) to fetch multiple stored documents in one 
go? 

The reasoning behind is that now with compressed blocks random-access gets more 
expensive, and in some cases  a user  needs to fetch more documents in one go. 
If it happens that more documents come from one block it is a win. I would also 
assume, even without compression , bulk access on sorted docIds cold be a win 
(sequential access)?

Does that make sense, is it doable? Or even worse, does it already exist :)

By the way, I am impressed how well compression does, even on really short 
stored documents, approx. 150b  we observe 35% reduction. Fetching 1000 short 
documents on fully cached index  is observably slower (2-3 times), but as soon 
as you memory gets low, compression wins quickly. Did not test it thoroughly, 
but looks good so far. Great job!


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()

2012-11-02 Thread eksdev
I *think* I did. Will try to build again...


On Nov 2, 2012, at 1:12 PM, Simon Willnauer simon.willna...@gmail.com wrote:

 did you clean your checkout?
 
 simon
 
 On Fri, Nov 2, 2012 at 1:10 PM, eksdev eks...@googlemail.com wrote:
 debugQuery=true on on simple TermQuery (/trunk version from yesterday)
 throws exception on explain()
 
 It seams some of the scorers do not play nicely with explain()?
 
 
 java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()F\n\tat
 
 org.apache.lucene.search.TermQuery$TermWeight.explain(TermQuery.java:119)\n\tat
 
 org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:636)\n\tat
 
 org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:616)\n\tat
 
 org.apache.solr.search.SolrIndexSearcher.explain(SolrIndexSearcher.java:1949)\n\tat
 
 org.apache.solr.util.SolrPluginUtils.getExplanations(SolrPluginUtils.java:352)\n\tat
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()

2012-11-02 Thread eksdev
clean did it, 
never use eclipse if you want to know exactly what you did, command line works 
always

sorry for the noise

On Nov 2, 2012, at 1:15 PM, eksdev eks...@googlemail.com wrote:

 I *think* I did. Will try to build again...
 
 
 On Nov 2, 2012, at 1:12 PM, Simon Willnauer simon.willna...@gmail.com wrote:
 
 did you clean your checkout?
 
 simon
 
 On Fri, Nov 2, 2012 at 1:10 PM, eksdev eks...@googlemail.com wrote:
 debugQuery=true on on simple TermQuery (/trunk version from yesterday)
 throws exception on explain()
 
 It seams some of the scorers do not play nicely with explain()?
 
 
 java.lang.NoSuchMethodError: org.apache.lucene.search.Scorer.freq()F\n\tat
 
 org.apache.lucene.search.TermQuery$TermWeight.explain(TermQuery.java:119)\n\tat
 
 org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:636)\n\tat
 
 org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:616)\n\tat
 
 org.apache.solr.search.SolrIndexSearcher.explain(SolrIndexSearcher.java:1949)\n\tat
 
 org.apache.solr.util.SolrPluginUtils.getExplanations(SolrPluginUtils.java:352)\n\tat
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org