[ 
https://issues.apache.org/jira/browse/DROIDS-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713895#action_12713895
 ] 

Mingfai Ma commented on DROIDS-52:
----------------------------------

one more set of figures, for an URL "http://www.apache.org/12345678";
 - in URI, it's 224 bytes
 - in String, it's 96 bytes
 - in byte[], it's 48 bytes

just for example, if in the link task we store the URI as bytes[],

store data as byte[] is quite extremely. In crawling, we probably concerns 
whether CPU, Memory or bandwidth are more costly.

In Droids, the use of URI, String, and Link are not too standardized:
 - URLFilter: String filter(String urlString); 
 - parser: Parse parse(ContentEntity entity, Link link) throws DroidsException, 
IOException;
 - handler: void handle(URI uri, ContentEntity entity)
 - LinkTask: public LinkTask( Link from, URI uri, int depth )
 
In modern CPU, the construction of URI is quite trivial. In a quick test in my 
PC, the following piece of code takes 5s to run:
{code}
        int max = 1000000;
        String url = "http://www.apache.org/";;
        byte[] bytes = url.getBytes();

        long beginTime = System.currentTimeMillis();
        for (int i = 0; i < max; i++) {
            new URI(new String(bytes));
        }
        System.out.println("elapsed time: " + (System.currentTimeMillis() - 
beginTime) + "ms");
{code}

My initial thought is, we should standardize the interface to use either URI or 
String. 

> Optimize memory usage of TaskQueue and History
> ----------------------------------------------
>
>                 Key: DROIDS-52
>                 URL: https://issues.apache.org/jira/browse/DROIDS-52
>             Project: Droids
>          Issue Type: Wish
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>            Priority: Minor
>         Attachments: TaskQueueMemoryTest.java
>
>
> Tasks in TaskQueue and History are items that has to be "persisted" in a 
> single crawl "session"/run. They are not consuming too much memory right now 
> and this task is created for tracking some optimization ideas. 
> The following is some sample memory usage figures in a 32-bit Windows Vista 
> environment: (refer to the attached test case)
>  - With javamex classmexer, 1M LinkTask in a queue consumes 280M of memory. 
>  - For history, stores as MD5 as String, each URL could take 104 bytes only. 
> 1M URL takes 100M roughly. (reference: 
> http://www.javamex.com/tutorials/memory/string_memory_usage.shtml) Notice 
> that MD5 is not guaranteed to be unique but it should be ok for general cases.
>  - To reduce memory footprint future, we may store MD5 as byte[], that take 
> exactly 32 bytes, and will consumes 32M memory for 1M records
> Previously, I ran a job that I try to reduce the memory usage for TaskQueue, 
> I tried to simulate a Queue function with JBossCache that support eviction 
> and passivation. JBossCache's passivation mechanism basically serialize the 
> item into a database (or other device) and unload them from memory. It could 
> effectively reduce memory usage. For a Queue with lots of items, there is no 
> need to keep them all in memory as they won't be processed at the same time 
> anyway. If it is necessary to keep a reference, we may passivate the LinkTask 
> and just keep a hash (MD5, or even hashCode()). 
> There is one more way to store the tasks in an embedded database such as 
> H2Database. It stores the data on disk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to