[ 
https://issues.apache.org/jira/browse/DROIDS-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingfai Ma updated DROIDS-52:
-----------------------------

    Description: 
Tasks in TaskQueue and History are items that has to be "persisted" in a single 
crawl "session"/run. They are not consuming too much memory right now and this 
task is created for tracking some optimization ideas. 

The following is some sample memory usage figures in a 32-bit Windows Vista 
environment: (refer to the attached test case)
 - With javamex classmexer, 1M LinkTask in a queue consumes 280M of memory. 
 - For history, stores as MD5 as String, each URL could take 104 bytes only. 1M 
URL takes 100M roughly. (reference: 
http://www.javamex.com/tutorials/memory/string_memory_usage.shtml) Notice that 
MD5 is not guaranteed to be unique but it should be ok for general cases.
 - To reduce memory footprint future, we may store MD5 as byte[], that take 
exactly 32 bytes, and will consumes 32M memory for 1M records

Previously, I ran a job that I try to reduce the memory usage for TaskQueue, I 
tried to simulate a Queue function with JBossCache that support eviction and 
passivation. JBossCache's passivation mechanism basically serialize the item 
into a database (or other device) and unload them from memory. It could 
effectively reduce memory usage. For a Queue with lots of items, there is no 
need to keep them all in memory as they won't be processed at the same time 
anyway. If it is necessary to keep a reference, we may passivate the LinkTask 
and just keep a hash (MD5, or even hashCode()). 

There is one more way to store the tasks in an embedded database such as 
H2Database. It stores the data on disk.






  was:
Tasks in TaskQueue and History are items that has to be persistent in a single 
crawl "session"/run. They are not consuming too much memory right now and this 
task is created for tracking some optimization ideas. 

The following is some sample memory usage figures in a 32-bit Windows Vista 
environment: (refer to the attached test case)
 - With javamex classmexer, 1M LinkTask in a queue consumes 280M of memory. 
 - For history, stores as MD5 as String, each URL could take 104 bytes only. 1M 
URL takes 100M roughly. (reference: 
http://www.javamex.com/tutorials/memory/string_memory_usage.shtml) Notice that 
MD5 is not guaranteed to be unique but it should be ok for general cases.
 - To reduce memory footprint future, we may store MD5 as byte[], that take 
exactly 32 bytes, and will consumes 32M memory for 1M records

Previously, I ran a job that I try to reduce the memory usage for TaskQueue, I 
tried to simulate a Queue function with JBossCache that support eviction and 
passivation. JBossCache's passivation mechanism basically serialize the item 
into a database (or other device) and unload them from memory. It could 
effectively reduce memory usage. For a Queue with lots of items, there is no 
need to keep them all in memory as they won't be processed at the same time 
anyway. If it is necessary to keep a reduce, we may passivate the LinkTask and 
just keep a hash (MD5, or even hashCode()). 







> Optimize memory usage of TaskQueue and History
> ----------------------------------------------
>
>                 Key: DROIDS-52
>                 URL: https://issues.apache.org/jira/browse/DROIDS-52
>             Project: Droids
>          Issue Type: Wish
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>            Priority: Minor
>
> Tasks in TaskQueue and History are items that has to be "persisted" in a 
> single crawl "session"/run. They are not consuming too much memory right now 
> and this task is created for tracking some optimization ideas. 
> The following is some sample memory usage figures in a 32-bit Windows Vista 
> environment: (refer to the attached test case)
>  - With javamex classmexer, 1M LinkTask in a queue consumes 280M of memory. 
>  - For history, stores as MD5 as String, each URL could take 104 bytes only. 
> 1M URL takes 100M roughly. (reference: 
> http://www.javamex.com/tutorials/memory/string_memory_usage.shtml) Notice 
> that MD5 is not guaranteed to be unique but it should be ok for general cases.
>  - To reduce memory footprint future, we may store MD5 as byte[], that take 
> exactly 32 bytes, and will consumes 32M memory for 1M records
> Previously, I ran a job that I try to reduce the memory usage for TaskQueue, 
> I tried to simulate a Queue function with JBossCache that support eviction 
> and passivation. JBossCache's passivation mechanism basically serialize the 
> item into a database (or other device) and unload them from memory. It could 
> effectively reduce memory usage. For a Queue with lots of items, there is no 
> need to keep them all in memory as they won't be processed at the same time 
> anyway. If it is necessary to keep a reference, we may passivate the LinkTask 
> and just keep a hash (MD5, or even hashCode()). 
> There is one more way to store the tasks in an embedded database such as 
> H2Database. It stores the data on disk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to