[
https://issues.apache.org/jira/browse/DROIDS-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mingfai Ma updated DROIDS-54:
-----------------------------
Attachment: SampleLink.java
attached is a sample implementation for review
- we still can make a LinkTask extend this base Link class, or just add more
method to this class (and optionally change it to LinkTask)
- it stores url as String, but the constructor always call new URI() to ensure
the url string is valid in construction time.
- stuff like toString, equals and hashCode maybe deleted in the final
implementation. or change them to follow this project's standard.
- a few convenient method are added, such as getHost(), getURI(),
resolve(String) are added. for resolve, it's added just like the URI has a
resolve method. using a LinkResolver with the same base URI could be slightly
more efficient.
for me, i am using a crawler derived from Droids, and I make the all usage of
Link as <T extends Link>. e.g. LinkQueue<T extends Link> extends
PriorityBlockingQueue<T>. This also could be considered.
> Make LinkTask supports arbitrary data by extends HashMap, and consider to
> refactor Task, Link, and LinkTask
> -----------------------------------------------------------------------------------------------------------
>
> Key: DROIDS-54
> URL: https://issues.apache.org/jira/browse/DROIDS-54
> Project: Droids
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: SampleLink.java
>
>
> refer to the initial idea at:
> https://issues.apache.org/jira/browse/DROIDS-48?focusedCommentId=12721121&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12721121
> The current implementation of LinkTask
> {code}
> public class LinkTask implements Link, Serializable
> {
> private Date started;
> private final int depth;
> private final URI uri;
> private final Link from;
>
> private Date lastModifedDate;
> private Collection<URI> linksTo;
> private String anchorText;
> private int weight;
> {code}
> Suggested change:
> {code}
> public class LinkTask extends HashMap<String, Serializable>
> or
> public class LinkTask extends HashMap<String, Serializable> implements Link
> {code}
> The minimum required attributes are:
> - final ? id,
> - mainly to have a minimum size value as hash key and store in memory/data
> grid for lookup, e.g. for use as history to avoid duplicated fetching. refer
> to DROIDS-53
> - final String url
> - the original String representation of the URL (preferred), or
> java.net.URI representation with the encoded string (seems no good).
> - the url is the original one provided by the user in construction. two
> diff url may refer to the same url, e.g. http://www.apache.org and
> http://www.apache.org/, it's up to the user to decide if they should be
> normalized. (and they could use the URL/LinkNormalizer in DROIDS-45
> the other fields are basically optional.
> - started/taskDate, if the queue use it for sorting, then it's useful,
> otherwise, it's just for logging.
> - "weight" is another example that not all implementation may need.
> - "linksTo", a.k.a. outLinks, is also optional to be attached to the
> LinkTask. an implementation may extract the outlink and put them in queue
> directly without storing the outlinks in the LinkTask.
> - "from", a.k.a. referrer, should not store the Link reference as it will
> affect GC.
> btw, should we also simplify Link, Task and LinkTask? if we use a Map, it's
> very generic already. Link and Task could be different concepts if we need to
> use them separately.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.