RE: Our Optimize Suggestions on lucene 3.5

2015-01-30 Thread Uwe Schindler
Sorry Robert – you’re right,

 

I had the impression that we changed that already. In fact, the WeakHashMap is 
needed, because multiple readers (especially Slow ones) can share the same 
uninverted fields. In the ideal world, we should change the whole stuff and 
remove FieldCacheImpl completely and let the field maps stay directly on the 
UninvertingReader as regular member fields. The only problem with this is: if 
you have multiple UninvertigReaders, all of them have separate uninverted 
instances. But this is a bug already if you do this.

 

Uwe

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, January 30, 2015 5:04 AM
To: dev@lucene.apache.org
Subject: Re: Our Optimize Suggestions on lucene 3.5

 

I am not sure this is the case. Actually, FieldCacheImpl still works as before 
and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners already 
ensure purging from the map, so I don't think the weak map serves any purpose 
today. The only possible advantage it has is to allow you to GC fieldcaches 
when you are already leaking readers... it could just be a regular map IMO.

 

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote:

Hi,

parts of your suggestions are already done in Lucene 4+. For one part I can 
tell you:


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory 
leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a 
lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global 
synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but 
they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

All Lucene items no longer apply:

1.   FieldCache is gone and is no longer supported in Lucene 5. You should 
use the new DocValues index format for that (column based storage, optimized 
for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this 
one has no weak maps anymore because it is no cache.

2.   No idea about that one - its unrelated to Lucene

3.   AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s 
java.lang.ClassValue to attach the implementation class to the interface. No 
concurrency problems anymore. It also uses MethodHandles to invoke the 
attribute classes.

4.   NumericField no longer exists, the base class does not use 
AttributeSource. All field instances now automatically reuse the inner 
TokenStream instances across fields, too!

5.   See above

In addition, Lucene has much better memory use, because terms are no longer 
UTF-16 strings and are in large shared byte arrays. So a lot of those other 
“optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming 
out the next few days).

Uwe

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de http://www.thetaphi.de/ 

eMail: u...@thetaphi.de

 

From: yannianmu(母延年) [mailto:yannia...@tencent.com] 
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5

 

 

 

Dear Lucene dev

We are from the the Hermes team. Hermes is a project base on lucene 3.5 and 
solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total 
days (two month). Nowadays our single cluster index size is over then 
200Tb,total size is 600T. We use lucene for the big data warehouse  speed up 
.reduce the analysis response time, for example filter like this age=32 and 
keywords like 'lucene'  or do some thing like count ,sum,order by group by and 
so on.

 

Hermes could filter a data form 1000billions in 1 secondes.10billions 
data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions 
days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays 
lucene has change so much since version 4.10, the coding has change so much.so 
we don`t want to commit our code to lucene .only to introduce our imporve base 
on luene 3.5,and introduce how hermes can process 100billions documents per day 
on 32 Physical Machines.we think it may be helpfull for some people who have 
the similary sense .

 

 


 


First level index(tii),Loading by Demand


Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of 
index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset

Re: Our Optimize Suggestions on lucene 3.5

2015-01-30 Thread Robert Muir
I think this is all fine. Because things are keyed on core-reader and
there are already core listeners installed to purge when the ref count
for a core drops to zero.

honestly if you change the map to a regular one, all tests pass.

On Fri, Jan 30, 2015 at 5:37 AM, Uwe Schindler u...@thetaphi.de wrote:
 Sorry Robert – you’re right,



 I had the impression that we changed that already. In fact, the WeakHashMap
 is needed, because multiple readers (especially Slow ones) can share the
 same uninverted fields. In the ideal world, we should change the whole stuff
 and remove FieldCacheImpl completely and let the field maps stay directly on
 the UninvertingReader as regular member fields. The only problem with this
 is: if you have multiple UninvertigReaders, all of them have separate
 uninverted instances. But this is a bug already if you do this.



 Uwe



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Friday, January 30, 2015 5:04 AM
 To: dev@lucene.apache.org
 Subject: Re: Our Optimize Suggestions on lucene 3.5



 I am not sure this is the case. Actually, FieldCacheImpl still works as
 before and has a weak hashmap still.

 However, i think the weak map is unnecessary. reader close listeners already
 ensure purging from the map, so I don't think the weak map serves any
 purpose today. The only possible advantage it has is to allow you to GC
 fieldcaches when you are already leaking readers... it could just be a
 regular map IMO.



 On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote:

 Hi,

 parts of your suggestions are already done in Lucene 4+. For one part I can
 tell you:

 weakhashmap,hashmap , synchronized problem

 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory
 leak BUG.

 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a
 lot of memory

 3. AttributeSource use weakhashmap to cache class impl,and use a global
 synchronized reduce performance

 4. AttributeSource is a base class , NumbericField extends
 AttributeSource,but they create a lot of hashmap,but NumbericField never use
 it .

 5. all of this ,JVM GC take a lot of burder for the never used hashmap.

 All Lucene items no longer apply:

 1.   FieldCache is gone and is no longer supported in Lucene 5. You
 should use the new DocValues index format for that (column based storage,
 optimized for sorting, numeric). You can still use Lucene’s
 UninvertingReader, bus this one has no weak maps anymore because it is no
 cache.

 2.   No idea about that one - its unrelated to Lucene

 3.   AttributeSource no longer uses this, since Lucene 4.8 it uses Java
 7’s java.lang.ClassValue to attach the implementation class to the
 interface. No concurrency problems anymore. It also uses MethodHandles to
 invoke the attribute classes.

 4.   NumericField no longer exists, the base class does not use
 AttributeSource. All field instances now automatically reuse the inner
 TokenStream instances across fields, too!

 5.   See above

 In addition, Lucene has much better memory use, because terms are no longer
 UTF-16 strings and are in large shared byte arrays. So a lot of those other
 “optimizations” are handled in a different way in Lucene 4 and Lucene 5
 (coming out the next few days).

 Uwe

 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 From: yannianmu(母延年) [mailto:yannia...@tencent.com]
 Sent: Thursday, January 29, 2015 12:59 PM
 To: general; dev; commits
 Subject: Our Optimize Suggestions on lucene 3.5







 Dear Lucene dev

 We are from the the Hermes team. Hermes is a project base on lucene 3.5
 and solr 3.5.

 Hermes process 100 billions documents per day,2000 billions document for
 total days (two month). Nowadays our single cluster index size is over then
 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up
 .reduce the analysis response time, for example filter like this age=32 and
 keywords like 'lucene'  or do some thing like count ,sum,order by group by
 and so on.



 Hermes could filter a data form 1000billions in 1 secondes.10billions
 data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions
 days`s sum,avg,max,min stat taken 30 s

 For those purpose,We made lots of improve base on lucene and solr , nowadays
 lucene has change so much since version 4.10, the coding has change so
 much.so we don`t want to commit our code to lucene .only to introduce our
 imporve base on luene 3.5,and introduce how hermes can process 100billions
 documents per day on 32 Physical Machines.we think it may be helpfull for
 some people who have the similary sense .







 First level index(tii),Loading by Demand

 Original:

 1. .tii file is load to ram by TermInfosReaderIndex

 2. that may quite slowly by first open Index

 3

RE: Re: Our Optimize Suggestions on lucene 3.5

2015-01-30 Thread Uwe Schindler
Hi yannianmu,

What you propose here is a so-called „soft reference cache“. That’s something 
completely different. In fact FieldCache was never be a “cache” because it was 
never able to evict entries on memory pressure. The weak map has a different 
reason (please note, it is “weak keys” not “weak values” as you do in your 
implementation). Weak maps are not useful for caches at all. They are useful to 
decouple to object instances from each other.

The weak map in FieldCacheImpl had the reason to prevent memory leaks if you 
open multiple IndexReaders on the same segments. If the last one is closed and 
garbage collected, the corresponding uninverted field should disappear. And 
this works correctly. But this does not mean that uninverted fields are removed 
on memory pressure: once loaded the uninverted stuff keeps alive until all 
referring readers are closed – this is the idea behind the design, so there is 
no memory leak! If you want a cache that discards the cached entries on memory 
pressure, implement your own field”cache” (in fact a real “cache” like you did).

Uwe

P.S.: FieldCache was a bad name, because it was no “cache”. This is why it 
should be used as “UninvertingReader” now.

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: yannianmu(母延年) [mailto:yannia...@tencent.com] 
Sent: Friday, January 30, 2015 5:47 AM
To: Robert Muir; dev
Subject: Re: Re: Our Optimize Suggestions on lucene 3.5

 

WeakHashMap may be cause a  memory leak problem.

 

we use SoftReference instad of it like this;

 

 

  public static class SoftLinkMap{

  private static int SORT_CACHE_SIZE=1024;

private static float LOADFACTOR=0.75f;

final MapObject,SoftReferenceMapEntry,Object readerCache_lru=new 
LinkedHashMapObject,SoftReferenceMapEntry,Object((int) 
Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) {

@Override

protected boolean 
removeEldestEntry(Map.EntryObject,SoftReferenceMapEntry,Object eldest) {

  return size()  SORT_CACHE_SIZE;

}

  };

  

   public void remove(Object key)

   {

   readerCache_lru.remove(key);

   }

   

   public MapEntry,Object get(Object key)

   {

   SoftReferenceMapEntry,Object w =  readerCache_lru.get(key);

  if(w==null)

  {

  return null;

  }

return w.get();

   }

   

   

   public void put(Object key,MapEntry,Object value)

   {

   readerCache_lru.put(key, new SoftReferenceMapEntry,Object(value));

   }

   

   public Setjava.util.Map.EntryObject, MapEntry, Object entrySet()

   {

   HashMapObject,MapEntry,Object rtn=new HashMapObject, 
MapEntry,Object();

   for(java.util.Map.EntryObject,SoftReferenceMapEntry,Object 
e:readerCache_lru.entrySet())

   {

   MapEntry,Object v=e.getValue().get();

   if(v!=null)

   {

   rtn.put(e.getKey(), v);

   }

   }

   return rtn.entrySet();

   }

  }

 

  final SoftLinkMap readerCache=new SoftLinkMap();

//final MapObject,MapEntry,Object readerCache = new 
WeakHashMapObject,MapEntry,Object();



 

  _  

yannianmu(母延年)

 

From: Robert Muir mailto:rcm...@gmail.com 

Date: 2015-01-30 12:03

To: dev@lucene.apache.org

Subject: Re: Our Optimize Suggestions on lucene 3.5

I am not sure this is the case. Actually, FieldCacheImpl still works as before 
and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners already 
ensure purging from the map, so I don't think the weak map serves any purpose 
today. The only possible advantage it has is to allow you to GC fieldcaches 
when you are already leaking readers... it could just be a regular map IMO.

 

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote:

Hi,

parts of your suggestions are already done in Lucene 4+. For one part I can 
tell you:


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory 
leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a 
lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global 
synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but 
they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

All Lucene items no longer apply:

1.   FieldCache is gone and is no longer supported in Lucene 5. You should 
use the new DocValues index format for that (column based storage, optimized 
for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this 
one has no weak maps anymore because it is no cache.

2.   No idea about that one - its unrelated to Lucene

3.   AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s 
java.lang.ClassValue to attach the implementation class to the interface. No 
concurrency problems anymore

RE: Our Optimize Suggestions on lucene 3.5

2015-01-29 Thread Uwe Schindler
Hi,

parts of your suggestions are already done in Lucene 4+. For one part I can 
tell you:


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory 
leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a 
lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global 
synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but 
they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

All Lucene items no longer apply:

1.   FieldCache is gone and is no longer supported in Lucene 5. You should 
use the new DocValues index format for that (column based storage, optimized 
for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this 
one has no weak maps anymore because it is no cache.

2.   No idea about that one - its unrelated to Lucene

3.   AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s 
java.lang.ClassValue to attach the implementation class to the interface. No 
concurrency problems anymore. It also uses MethodHandles to invoke the 
attribute classes.

4.   NumericField no longer exists, the base class does not use 
AttributeSource. All field instances now automatically reuse the inner 
TokenStream instances across fields, too!

5.   See above

In addition, Lucene has much better memory use, because terms are no longer 
UTF-16 strings and are in large shared byte arrays. So a lot of those other 
“optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming 
out the next few days).

Uwe

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de http://www.thetaphi.de/ 

eMail: u...@thetaphi.de

 

From: yannianmu(母延年) [mailto:yannia...@tencent.com] 
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5

 

 

 

Dear Lucene dev

We are from the the Hermes team. Hermes is a project base on lucene 3.5 and 
solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total 
days (two month). Nowadays our single cluster index size is over then 
200Tb,total size is 600T. We use lucene for the big data warehouse  speed up 
.reduce the analysis response time, for example filter like this age=32 and 
keywords like 'lucene'  or do some thing like count ,sum,order by group by and 
so on.

 

Hermes could filter a data form 1000billions in 1 secondes.10billions 
data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions 
days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays 
lucene has change so much since version 4.10, the coding has change so much.so 
we don`t want to commit our code to lucene .only to introduce our imporve base 
on luene 3.5,and introduce how hermes can process 100billions documents per day 
on 32 Physical Machines.we think it may be helpfull for some people who have 
the similary sense .

 

 


 


First level index(tii),Loading by Demand


Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of 
index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we 
use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when 
open a index

4. hermes often open different index for dirrerent Business; when the index is 
not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 10 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough 
memory to store the tii file

2. we have over then 10 number of index,if all is opend ,that will weast 
lots of file descriptor,the file system will not allow.

 


Build index on Hdfs


1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on 
hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move 
need lots of hours),the process can quick move to other machine by zookeeper 
heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local 
file system the OS make so

Re: Our Optimize Suggestions on lucene 3.5

2015-01-29 Thread Robert Muir
I am not sure this is the case. Actually, FieldCacheImpl still works as
before and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners
already ensure purging from the map, so I don't think the weak map serves
any purpose today. The only possible advantage it has is to allow you to GC
fieldcaches when you are already leaking readers... it could just be a
regular map IMO.

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote:

 Hi,

 parts of your suggestions are already done in Lucene 4+. For one part I
 can tell you:
 weakhashmap,hashmap , synchronized problem

 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory
 leak BUG.

 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that
 weast a lot of memory

 3. AttributeSource use weakhashmap to cache class impl,and use a global
 synchronized reduce performance

 4. AttributeSource is a base class , NumbericField extends AttributeSource,but
 they create a lot of hashmap,but NumbericField never use it .

 5. all of this ,JVM GC take a lot of burder for the never used hashmap.

 All Lucene items no longer apply:

 1.   FieldCache is gone and is no longer supported in Lucene 5. You
 should use the new DocValues index format for that (column based storage,
 optimized for sorting, numeric). You can still use Lucene’s
 UninvertingReader, bus this one has no weak maps anymore because it is no
 cache.

 2.   No idea about that one - its unrelated to Lucene

 3.   AttributeSource no longer uses this, since Lucene 4.8 it uses
 Java 7’s java.lang.ClassValue to attach the implementation class to the
 interface. No concurrency problems anymore. It also uses MethodHandles to
 invoke the attribute classes.

 4.   NumericField no longer exists, the base class does not use
 AttributeSource. All field instances now automatically reuse the inner
 TokenStream instances across fields, too!

 5.   See above

 In addition, Lucene has much better memory use, because terms are no
 longer UTF-16 strings and are in large shared byte arrays. So a lot of
 those other “optimizations” are handled in a different way in Lucene 4 and
 Lucene 5 (coming out the next few days).

 Uwe

 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* yannianmu(母延年) [mailto:yannia...@tencent.com]
 *Sent:* Thursday, January 29, 2015 12:59 PM
 *To:* general; dev; commits
 *Subject:* Our Optimize Suggestions on lucene 3.5







 Dear Lucene dev

 We are from the the Hermes team. Hermes is a project base on lucene
 3.5 and solr 3.5.

 Hermes process 100 billions documents per day,2000 billions document for
 total days (two month). Nowadays our single cluster index size is over
 then 200Tb,total size is 600T. We use lucene for the big data warehouse
 speed up .reduce the analysis response time, for example filter like this
 age=32 and keywords like 'lucene'  or do some thing like count ,sum,order
 by group by and so on.



 Hermes could filter a data form 1000billions in 1 secondes.10billions
 data`s order by taken 10s,10billions data`s group by thaken 15 s,10
 billions days`s sum,avg,max,min stat taken 30 s

 For those purpose,We made lots of improve base on lucene and solr ,
 nowadays lucene has change so much since version 4.10, the coding has
 change so much.so we don`t want to commit our code to lucene .only to
 introduce our imporve base on luene 3.5,and introduce how hermes can
 process 100billions documents per day on 32 Physical Machines.we think it
 may be helpfull for some people who have the similary sense .




  First level index(tii),Loading by Demand

 Original:

 1. .tii file is load to ram by TermInfosReaderIndex

 2. that may quite slowly by first open Index

 3. the index need open by Persistence,once open it ,nevel close it.

 4. this cause will limit the number of the index.when we have thouthand of
 index,that will Impossible.

 Our improve:

 1. Loading by Demand,not all fields need to load into memory

 2. we modify the method getIndexOffset(dichotomy) on disk, not on
 memory,but we use lru cache to speed up it.

 3. getIndexOffset on disk can save lots of memory,and can reduce times
 when open a index

 4. hermes often open different index for dirrerent Business; when the
 index is not often to used ,we will to close it.(manage by lru)

 5. such this my 1 Physical Machine can store over then 10 number of
 index.

 Solve the problem:

 1. hermes need to store over then 1000billons documents,we have not enough
 memory to store the tii file

 2. we have over then 10 number of index,if all is opend ,that will
 weast lots of file descriptor,the file system will not allow.


 Build index on Hdfs

 1. We modifyed lucene 3.5 code at 2013.so that we can build index direct
 on hdfs.(lucene has support hdfs since 4.0)

 2. All the offline data is build by mapreduce on hdfs.

 3. we move all

Re: Re: Our Optimize Suggestions on lucene 3.5

2015-01-29 Thread 母延年
WeakHashMap may be cause a  memory leak problem.

we use SoftReference instad of it like this;


  public static class SoftLinkMap{
  private static int SORT_CACHE_SIZE=1024;
private static float LOADFACTOR=0.75f;
final MapObject,SoftReferenceMapEntry,Object readerCache_lru=new 
LinkedHashMapObject,SoftReferenceMapEntry,Object((int) 
Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) {
@Override
protected boolean 
removeEldestEntry(Map.EntryObject,SoftReferenceMapEntry,Object eldest) {
  return size()  SORT_CACHE_SIZE;
}
  };

   public void remove(Object key)
   {
   readerCache_lru.remove(key);
   }

   public MapEntry,Object get(Object key)
   {
   SoftReferenceMapEntry,Object w =  readerCache_lru.get(key);
  if(w==null)
  {
  return null;
  }
return w.get();
   }


   public void put(Object key,MapEntry,Object value)
   {
   readerCache_lru.put(key, new SoftReferenceMapEntry,Object(value));
   }

   public Setjava.util.Map.EntryObject, MapEntry, Object entrySet()
   {
   HashMapObject,MapEntry,Object rtn=new HashMapObject, 
MapEntry,Object();
   for(java.util.Map.EntryObject,SoftReferenceMapEntry,Object 
e:readerCache_lru.entrySet())
   {
   MapEntry,Object v=e.getValue().get();
   if(v!=null)
   {
   rtn.put(e.getKey(), v);
   }
   }
   return rtn.entrySet();
   }
  }

  final SoftLinkMap readerCache=new SoftLinkMap();
//final MapObject,MapEntry,Object readerCache = new 
WeakHashMapObject,MapEntry,Object();



yannianmu(母延年)

From: Robert Muirmailto:rcm...@gmail.com
Date: 2015-01-30 12:03
To: dev@lucene.apache.orgmailto:dev@lucene.apache.org
Subject: Re: Our Optimize Suggestions on lucene 3.5
I am not sure this is the case. Actually, FieldCacheImpl still works as before 
and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners already 
ensure purging from the map, so I don't think the weak map serves any purpose 
today. The only possible advantage it has is to allow you to GC fieldcaches 
when you are already leaking readers... it could just be a regular map IMO.

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler 
u...@thetaphi.demailto:u...@thetaphi.de wrote:
Hi,
parts of your suggestions are already done in Lucene 4+. For one part I can 
tell you:
weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory 
leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a 
lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global 
synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but 
they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.
All Lucene items no longer apply:

1.   FieldCache is gone and is no longer supported in Lucene 5. You should 
use the new DocValues index format for that (column based storage, optimized 
for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this 
one has no weak maps anymore because it is no cache.

2.   No idea about that one - its unrelated to Lucene

3.   AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s 
java.lang.ClassValue to attach the implementation class to the interface. No 
concurrency problems anymore. It also uses MethodHandles to invoke the 
attribute classes.

4.   NumericField no longer exists, the base class does not use 
AttributeSource. All field instances now automatically reuse the inner 
TokenStream instances across fields, too!

5.   See above
In addition, Lucene has much better memory use, because terms are no longer 
UTF-16 strings and are in large shared byte arrays. So a lot of those other 
?optimizations? are handled in a different way in Lucene 4 and Lucene 5 (coming 
out the next few days).
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.dehttp://www.thetaphi.de/
eMail: u...@thetaphi.demailto:u...@thetaphi.de

From: yannianmu(?ꉄ?N) 
[mailto:yannia...@tencent.commailto:yannia...@tencent.com]
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5






Dear Lucene dev

We are from the the Hermes team. Hermes is a project base on lucene 3.5 and 
solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total 
days (two month). Nowadays our single cluster index size is over then 
200Tb,total size is 600T. We use lucene for the big data warehouse  speed up 
.reduce the analysis response time, for example filter like this age=32 and 
keywords like 'lucene'  or do some thing like count ,sum,order by group by and 
so on.



Hermes could filter a data form 1000billions in 1 secondes.10billions 
data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions 
days`s sum