[ 
https://issues.apache.org/jira/browse/LUCENE-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergő Törcsvári updated LUCENE-5736:
------------------------------------

    Attachment: CachingNaiveBayesClassifier.java

The attached class is a working copy!

This is a cache included version of the SimpleNaiveBayes classifier. The cache 
is a hash-map, if a word needed, we search it for the all class and take it to 
the hash. Next time, we pull out from the cache and not searching in the index 
again.

The cache (re)initialization is recalculating the docsWithClassSize, clear the 
hash-maps, and prepare new ones. 2 map needed, and a list, the first map will 
contains the term-classes-termInClassOccurrence (this is the cache), the list 
contains the classnames, and the second map contains the 
class-avgUniqueTermNumber. The last two is fully preloaded, the first is 
dynamically building in the searches.

If there are a lot term and/or class its need a lot memory so there is a build 
in possibility for cutting the cache size. If there are terms thats really rare 
we expect that they will rarely come out in the other documents too, and they 
are left out from the cache. There is a possibility to left them out full from 
the classification calculation too.

> Separate the classifiers to online and caching where possible
> -------------------------------------------------------------
>
>                 Key: LUCENE-5736
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5736
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: modules/classification
>            Reporter: Gergő Törcsvári
>         Attachments: CachingNaiveBayesClassifier.java
>
>
> The Lucene classifier implementations are now near onlines if they get a near 
> realtime reader. It is good for the users whoes have a continously changing 
> dataset, but slow for not changing datasets.
> The idea is: What if we implement a cache and speed up the results where it 
> is possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to