[
https://issues.apache.org/jira/browse/DROIDS-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721119#action_12721119
]
Mingfai Ma commented on DROIDS-52:
----------------------------------
for the previous comment, I'm more focus on reducing memory usage. for Long
instead of Date, i mean we store the new Date().getTime() instead of the date.
And from my test, it uses less memory. Sorting will not be affected. I think
there is no impact unless our date usage involves timezone.
re. URI, i have found another issue.
- there are possibly invalid/non-standard URI on the Internet which are usable
by modern browsers. the following are two examples:
http://domainremoved.com/adserv|3.0|327|2062389|0|225|ADTECH;loc=300;grp=[group]
http://domainremoved.com/index.php?command=product:product->details&productId=1889640
they contains illegal character according to java.net.URI, but if we put the
link in Firefox, it does work.
- as you may know, Heritrix doesn't use URI but use UURI (usable URI).
- we probably should give up java.net.URI, i'm going to create another issue
for this later.
> Optimize memory usage of TaskQueue and History
> ----------------------------------------------
>
> Key: DROIDS-52
> URL: https://issues.apache.org/jira/browse/DROIDS-52
> Project: Droids
> Issue Type: Wish
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Priority: Minor
> Attachments: TaskQueueMemoryTest.java
>
>
> Tasks in TaskQueue and History are items that has to be "persisted" in a
> single crawl "session"/run. They are not consuming too much memory right now
> and this task is created for tracking some optimization ideas.
> The following is some sample memory usage figures in a 32-bit Windows Vista
> environment: (refer to the attached test case)
> - With javamex classmexer, 1M LinkTask in a queue consumes 280M of memory.
> - For history, stores as MD5 as String, each URL could take 104 bytes only.
> 1M URL takes 100M roughly. (reference:
> http://www.javamex.com/tutorials/memory/string_memory_usage.shtml) Notice
> that MD5 is not guaranteed to be unique but it should be ok for general cases.
> - To reduce memory footprint future, we may store MD5 as byte[], that take
> exactly 32 bytes, and will consumes 32M memory for 1M records
> Previously, I ran a job that I try to reduce the memory usage for TaskQueue,
> I tried to simulate a Queue function with JBossCache that support eviction
> and passivation. JBossCache's passivation mechanism basically serialize the
> item into a database (or other device) and unload them from memory. It could
> effectively reduce memory usage. For a Queue with lots of items, there is no
> need to keep them all in memory as they won't be processed at the same time
> anyway. If it is necessary to keep a reference, we may passivate the LinkTask
> and just keep a hash (MD5, or even hashCode()).
> There is one more way to store the tasks in an embedded database such as
> H2Database. It stores the data on disk.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.