RE: Our Optimize Suggestions on lucene 3.5
Sorry Robert – you’re right, I had the impression that we changed that already. In fact, the WeakHashMap is needed, because multiple readers (especially Slow ones) can share the same uninverted fields. In the ideal world, we should change the whole stuff and remove FieldCacheImpl completely and let the field maps stay directly on the UninvertingReader as regular member fields. The only problem with this is: if you have multiple UninvertigReaders, all of them have separate uninverted instances. But this is a bug already if you do this. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, January 30, 2015 5:04 AM To: dev@lucene.apache.org Subject: Re: Our Optimize Suggestions on lucene 3.5 I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still. However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO. On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, parts of your suggestions are already done in Lucene 4+. For one part I can tell you: weakhashmap,hashmap , synchronized problem 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG. 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory 3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance 4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it . 5. all of this ,JVM GC take a lot of burder for the never used hashmap. All Lucene items no longer apply: 1. FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this one has no weak maps anymore because it is no cache. 2. No idea about that one - its unrelated to Lucene 3. AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes. 4. NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too! 5. See above In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other “optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de http://www.thetaphi.de/ eMail: u...@thetaphi.de From: yannianmu(母延年) [mailto:yannia...@tencent.com] Sent: Thursday, January 29, 2015 12:59 PM To: general; dev; commits Subject: Our Optimize Suggestions on lucene 3.5 Dear Lucene dev We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5. Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene' or do some thing like count ,sum,order by group by and so on. Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense . First level index(tii),Loading by Demand Original: 1. .tii file is load to ram by TermInfosReaderIndex 2. that may quite slowly by first open Index 3. the index need open by Persistence,once open it ,nevel close it. 4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible. Our improve: 1. Loading by Demand,not all fields need to load into memory 2. we modify the method getIndexOffset
Re: Our Optimize Suggestions on lucene 3.5
I think this is all fine. Because things are keyed on core-reader and there are already core listeners installed to purge when the ref count for a core drops to zero. honestly if you change the map to a regular one, all tests pass. On Fri, Jan 30, 2015 at 5:37 AM, Uwe Schindler u...@thetaphi.de wrote: Sorry Robert – you’re right, I had the impression that we changed that already. In fact, the WeakHashMap is needed, because multiple readers (especially Slow ones) can share the same uninverted fields. In the ideal world, we should change the whole stuff and remove FieldCacheImpl completely and let the field maps stay directly on the UninvertingReader as regular member fields. The only problem with this is: if you have multiple UninvertigReaders, all of them have separate uninverted instances. But this is a bug already if you do this. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, January 30, 2015 5:04 AM To: dev@lucene.apache.org Subject: Re: Our Optimize Suggestions on lucene 3.5 I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still. However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO. On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, parts of your suggestions are already done in Lucene 4+. For one part I can tell you: weakhashmap,hashmap , synchronized problem 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG. 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory 3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance 4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it . 5. all of this ,JVM GC take a lot of burder for the never used hashmap. All Lucene items no longer apply: 1. FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this one has no weak maps anymore because it is no cache. 2. No idea about that one - its unrelated to Lucene 3. AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes. 4. NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too! 5. See above In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other “optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de From: yannianmu(母延年) [mailto:yannia...@tencent.com] Sent: Thursday, January 29, 2015 12:59 PM To: general; dev; commits Subject: Our Optimize Suggestions on lucene 3.5 Dear Lucene dev We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5. Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene' or do some thing like count ,sum,order by group by and so on. Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense . First level index(tii),Loading by Demand Original: 1. .tii file is load to ram by TermInfosReaderIndex 2. that may quite slowly by first open Index 3
RE: Re: Our Optimize Suggestions on lucene 3.5
Hi yannianmu, What you propose here is a so-called „soft reference cache“. That’s something completely different. In fact FieldCache was never be a “cache” because it was never able to evict entries on memory pressure. The weak map has a different reason (please note, it is “weak keys” not “weak values” as you do in your implementation). Weak maps are not useful for caches at all. They are useful to decouple to object instances from each other. The weak map in FieldCacheImpl had the reason to prevent memory leaks if you open multiple IndexReaders on the same segments. If the last one is closed and garbage collected, the corresponding uninverted field should disappear. And this works correctly. But this does not mean that uninverted fields are removed on memory pressure: once loaded the uninverted stuff keeps alive until all referring readers are closed – this is the idea behind the design, so there is no memory leak! If you want a cache that discards the cached entries on memory pressure, implement your own field”cache” (in fact a real “cache” like you did). Uwe P.S.: FieldCache was a bad name, because it was no “cache”. This is why it should be used as “UninvertingReader” now. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: yannianmu(母延年) [mailto:yannia...@tencent.com] Sent: Friday, January 30, 2015 5:47 AM To: Robert Muir; dev Subject: Re: Re: Our Optimize Suggestions on lucene 3.5 WeakHashMap may be cause a memory leak problem. we use SoftReference instad of it like this; public static class SoftLinkMap{ private static int SORT_CACHE_SIZE=1024; private static float LOADFACTOR=0.75f; final MapObject,SoftReferenceMapEntry,Object readerCache_lru=new LinkedHashMapObject,SoftReferenceMapEntry,Object((int) Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) { @Override protected boolean removeEldestEntry(Map.EntryObject,SoftReferenceMapEntry,Object eldest) { return size() SORT_CACHE_SIZE; } }; public void remove(Object key) { readerCache_lru.remove(key); } public MapEntry,Object get(Object key) { SoftReferenceMapEntry,Object w = readerCache_lru.get(key); if(w==null) { return null; } return w.get(); } public void put(Object key,MapEntry,Object value) { readerCache_lru.put(key, new SoftReferenceMapEntry,Object(value)); } public Setjava.util.Map.EntryObject, MapEntry, Object entrySet() { HashMapObject,MapEntry,Object rtn=new HashMapObject, MapEntry,Object(); for(java.util.Map.EntryObject,SoftReferenceMapEntry,Object e:readerCache_lru.entrySet()) { MapEntry,Object v=e.getValue().get(); if(v!=null) { rtn.put(e.getKey(), v); } } return rtn.entrySet(); } } final SoftLinkMap readerCache=new SoftLinkMap(); //final MapObject,MapEntry,Object readerCache = new WeakHashMapObject,MapEntry,Object(); _ yannianmu(母延年) From: Robert Muir mailto:rcm...@gmail.com Date: 2015-01-30 12:03 To: dev@lucene.apache.org Subject: Re: Our Optimize Suggestions on lucene 3.5 I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still. However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO. On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, parts of your suggestions are already done in Lucene 4+. For one part I can tell you: weakhashmap,hashmap , synchronized problem 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG. 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory 3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance 4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it . 5. all of this ,JVM GC take a lot of burder for the never used hashmap. All Lucene items no longer apply: 1. FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this one has no weak maps anymore because it is no cache. 2. No idea about that one - its unrelated to Lucene 3. AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore
RE: Our Optimize Suggestions on lucene 3.5
Hi, parts of your suggestions are already done in Lucene 4+. For one part I can tell you: weakhashmap,hashmap , synchronized problem 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG. 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory 3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance 4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it . 5. all of this ,JVM GC take a lot of burder for the never used hashmap. All Lucene items no longer apply: 1. FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this one has no weak maps anymore because it is no cache. 2. No idea about that one - its unrelated to Lucene 3. AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes. 4. NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too! 5. See above In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other “optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de http://www.thetaphi.de/ eMail: u...@thetaphi.de From: yannianmu(母延年) [mailto:yannia...@tencent.com] Sent: Thursday, January 29, 2015 12:59 PM To: general; dev; commits Subject: Our Optimize Suggestions on lucene 3.5 Dear Lucene dev We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5. Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene' or do some thing like count ,sum,order by group by and so on. Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense . First level index(tii),Loading by Demand Original: 1. .tii file is load to ram by TermInfosReaderIndex 2. that may quite slowly by first open Index 3. the index need open by Persistence,once open it ,nevel close it. 4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible. Our improve: 1. Loading by Demand,not all fields need to load into memory 2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it. 3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index 4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru) 5. such this my 1 Physical Machine can store over then 10 number of index. Solve the problem: 1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file 2. we have over then 10 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow. Build index on Hdfs 1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0) 2. All the offline data is build by mapreduce on hdfs. 3. we move all the realtime index from local disk to hdfs 4. we can ignore disk failure because of index on hdfs 5. we can move process from on machine to another machine on hdfs 6. we can quick recover index when a disk failure happend . 7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat. 8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so
Re: Our Optimize Suggestions on lucene 3.5
I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still. However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO. On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, parts of your suggestions are already done in Lucene 4+. For one part I can tell you: weakhashmap,hashmap , synchronized problem 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG. 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory 3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance 4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it . 5. all of this ,JVM GC take a lot of burder for the never used hashmap. All Lucene items no longer apply: 1. FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this one has no weak maps anymore because it is no cache. 2. No idea about that one - its unrelated to Lucene 3. AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes. 4. NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too! 5. See above In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other “optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de *From:* yannianmu(母延年) [mailto:yannia...@tencent.com] *Sent:* Thursday, January 29, 2015 12:59 PM *To:* general; dev; commits *Subject:* Our Optimize Suggestions on lucene 3.5 Dear Lucene dev We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5. Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene' or do some thing like count ,sum,order by group by and so on. Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense . First level index(tii),Loading by Demand Original: 1. .tii file is load to ram by TermInfosReaderIndex 2. that may quite slowly by first open Index 3. the index need open by Persistence,once open it ,nevel close it. 4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible. Our improve: 1. Loading by Demand,not all fields need to load into memory 2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it. 3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index 4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru) 5. such this my 1 Physical Machine can store over then 10 number of index. Solve the problem: 1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file 2. we have over then 10 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow. Build index on Hdfs 1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0) 2. All the offline data is build by mapreduce on hdfs. 3. we move all
Re: Re: Our Optimize Suggestions on lucene 3.5
WeakHashMap may be cause a memory leak problem. we use SoftReference instad of it like this; public static class SoftLinkMap{ private static int SORT_CACHE_SIZE=1024; private static float LOADFACTOR=0.75f; final MapObject,SoftReferenceMapEntry,Object readerCache_lru=new LinkedHashMapObject,SoftReferenceMapEntry,Object((int) Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) { @Override protected boolean removeEldestEntry(Map.EntryObject,SoftReferenceMapEntry,Object eldest) { return size() SORT_CACHE_SIZE; } }; public void remove(Object key) { readerCache_lru.remove(key); } public MapEntry,Object get(Object key) { SoftReferenceMapEntry,Object w = readerCache_lru.get(key); if(w==null) { return null; } return w.get(); } public void put(Object key,MapEntry,Object value) { readerCache_lru.put(key, new SoftReferenceMapEntry,Object(value)); } public Setjava.util.Map.EntryObject, MapEntry, Object entrySet() { HashMapObject,MapEntry,Object rtn=new HashMapObject, MapEntry,Object(); for(java.util.Map.EntryObject,SoftReferenceMapEntry,Object e:readerCache_lru.entrySet()) { MapEntry,Object v=e.getValue().get(); if(v!=null) { rtn.put(e.getKey(), v); } } return rtn.entrySet(); } } final SoftLinkMap readerCache=new SoftLinkMap(); //final MapObject,MapEntry,Object readerCache = new WeakHashMapObject,MapEntry,Object(); yannianmu(母延年) From: Robert Muirmailto:rcm...@gmail.com Date: 2015-01-30 12:03 To: dev@lucene.apache.orgmailto:dev@lucene.apache.org Subject: Re: Our Optimize Suggestions on lucene 3.5 I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still. However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO. On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler u...@thetaphi.demailto:u...@thetaphi.de wrote: Hi, parts of your suggestions are already done in Lucene 4+. For one part I can tell you: weakhashmap,hashmap , synchronized problem 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG. 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory 3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance 4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it . 5. all of this ,JVM GC take a lot of burder for the never used hashmap. All Lucene items no longer apply: 1. FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this one has no weak maps anymore because it is no cache. 2. No idea about that one - its unrelated to Lucene 3. AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes. 4. NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too! 5. See above In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other ?optimizations? are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.dehttp://www.thetaphi.de/ eMail: u...@thetaphi.demailto:u...@thetaphi.de From: yannianmu(?ꉄ?N) [mailto:yannia...@tencent.commailto:yannia...@tencent.com] Sent: Thursday, January 29, 2015 12:59 PM To: general; dev; commits Subject: Our Optimize Suggestions on lucene 3.5 Dear Lucene dev We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5. Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene' or do some thing like count ,sum,order by group by and so on. Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum