This issue presents tips, techniques, and sample code for the following
topics:
This issue of the JDC Tech Tips is written by
Patrick Chan,the author of the publication "The
JavaTM Developers Almanac".
Extracting Links from an HTML File
There are many applications that fetch an HTML page from the Web and
then extract the links from the page. For example, a link-checker
application fetches a page, extracts the links, and then checks the links
to see of they refer to actual pages.
The HTML 3.2 support in the JavaTM 2
platform makes it fairly easy to find and parse links. This tip
demonstrates how to use that support.
The first step is to create an editor kit. The purpose of an editor kit
is to parse data in some format, such as HTML or RTF, and store the
information in a data structure that fully represents the data. This data
structure, called a Document, allows you to examine and modify the data in
a convenient way.
Let's look at an example. In the following example program, we're going
to examine the HTML data in a Document object. The program looks for A
(anchor) tags and extracts the HREF attribute information from these tags.
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetLinks {
public static void main(String[] args) {
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
// The Document class does not yet
// handle charset's properly.
doc.putProperty("IgnoreCharsetDirective",
Boolean.TRUE);
try {
// Create a reader on the HTML content.
Reader rd = getReader(args[0]);
// Parse the HTML.
kit.read(rd, doc, 0);
// Iterate through the elements
// of the HTML document.
ElementIterator it = new ElementIterator(doc);
javax.swing.text.Element elem;
while ((elem = it.next()) != null) {
SimpleAttributeSet s = (SimpleAttributeSet)
elem.getAttributes().getAttribute(HTML.Tag.A);
if (s != null) {
System.out.println(
s.getAttribute(HTML.Attribute.HREF));
}
}
} catch (Exception e) {
e.printStackTrace();
}
System.exit(1);
}
// Returns a reader on the HTML data. If 'uri' begins
// with "http:", it's treated as a URL; otherwise,
// it's assumed to be a local filename.
static Reader getReader(String uri)
throws IOException {
if (uri.startsWith("http:")) {
// Retrieve from Internet.
URLConnection conn =
new URL(uri).openConnection();
return new
InputStreamReader(conn.getInputStream());
} else {
// Retrieve from file.
return new FileReader(uri);
}
}
}
This program takes one parameter from the command line. If the
parameter starts with "http:", the program treats the parameter as a URL
and fetches the HTML from that URL. Otherwise, the parameter is treated as
a filename and the HTML is fetched from that file.
For example,
$ java GetLinks http://java.sun.com
retrieves the HTML from the main page at java.sun.com.
The editor kit is an HTMLEditorKit object that contains an HTML parser.
It creates a Document object that can represent HTML. And it's the editor
kit's read()
method that parses the HTML and stores the
information in the Document.
Once the HTML data is saved in the Document object, we're ready to look
for links. This is done by creating an iterator (using
ElementIterator)
that iterates over all the visible text pieces
(called elements) in the HTML. For each text piece, we check to see if it
has been formatted for linking, in other words, whether the text is
formatted with the A (anchor) tag. We do this by calling
getAttributes().getAttribute(HTML.Tag.A)
. If the text piece
has been formatted with the A tag, the method call returns the set of
attributes of the A tag used to format that text piece. Otherwise the
method call simply returns null.
Note: The name getAttributes()
is a little confusing
because it has nothing to do with HTML attributes; the "attributes" in
this case are all the HTML tags (such as an A tag) that were used to
format that text piece.
Now we have the set of attributes of the A tag used to format a piece
of text; it's stored in a SimpleAttributeSet
object. So we
just need to get the value of the HREF attribute and we're done. We can do
this by calling getAttribute(HTML.Attribute.HREF)
on the A
tag's attribute set.
SORTING ARRAYS
This tip discusses how you can sort data in arrays. Sorting arrays of
primitive types is easy. There are seven methods in the class Arrays for
sorting arrays of each of the seven primitive types: byte, char, double,
float, int, long, and short. Here's an example that sorts an array of
doubles.
import java.util.*;
import java.awt.*;
class Sort1 {
// Sorts an array of random double values.
public static void main(String[] args) {
double[] dblarr = new double[10];
for (int i=0; i<dblarr.length; i++) {
dblarr[i] = Math.random();
}
// Sort the array.
Arrays.sort(dblarr);
//Print the array
for (int i=0; i<dblarr.length; i++){
System.out.println(dblarr[i]);
}
}
}
Sorting an array of objects is just as easy if the objects implement
the Comparable interface, java.util.Comparable
. This
interface gives a natural ordering for a class so that objects of that
class can be sorted. Here's an example that sorts an array of type String
that implements Comparable. import java.util.*;
import java.awt.*;
class Sort2 {
// Sorts the arguments in args.
public static void main(String[] args) {
Arrays.sort(args);
//Print the arguments in args
for (int i=0; i<args.length; i++){
System.out.println(args[i]);
}
}
}
What if the objects do not implement Comparable? Well, you've got
two choices: you can modify the objects to implement Comparable, or you
can supply a Comparator to the sort method. Let's look at the first option
first.
To make an object comparable you need to add Comparable to the object's
implements list. You then need to modify the object to implement the
compareTo()
method. The compareTo(
) method
compares the object with another object of the same type. If the object
should appear before the other object, compareTo()
should
return a negative number. If the object should appear after the other
object, compareTo()
should return a non-zero positive number.
Zero should be returned if the objects are equal.
Point is an AWT class that is not comparable. The following example
creates a version of Point that is comparable. It sorts points by distance
from the origin.
import java.util.*;
import java.awt.*;
class MyPoint extends java.awt.Point implements
Comparable {
MyPoint(int x, int y) {
super(x, y);
}
public int compareTo(Object o) {
MyPoint p = (MyPoint)o;
double d1 = Math.sqrt(x*x + y*y);
double d2 = Math.sqrt(p.x*p.x + p.y*p.y);
if (d1 < d2) {
return -1;
} else if (d2 < d1) {
return 1;
}
return 0;
}
}
class Sort3 {
public static void main(String[] args) {
Random rnd = new Random();
MyPoint[] points = new MyPoint[10];
for (int i=0; i<points.length; i++) {
points[i] = new MyPoint(rnd.nextInt(100),
rnd.nextInt(100));
}
Arrays.sort(points);
//Print the points
for (int i=0; i<points.length; i++){
System.out.println(points[i]);
}
}
}
If you can't or don't want to make an object Comparable, you can
supply a Comparator object to the Arrays.sort()
method. The
Comparator object must implement a method called compare(). The behaviour
of the compare()
method is almost identical to the
compareTo()
method of the Comparable interface.
The next example is similar to the one above. However, instead of
creating a special kind of Point, we create a comparator that can sort
Point objects.
import java.util.*;
import java.awt.*;
class PointComparator implements Comparator {
public int compare(Object o1, Object o2) {
Point p1 = (Point)o1;
Point p2 = (Point)o2;
double d1 = Math.sqrt(p1.x*p1.x + p1.y*p1.y);
double d2 = Math.sqrt(p2.x*p2.x + p2.y*p2.y);
if (d1 < d2) {
return -1;
} else if (d2 < d1) {
return 1;
}
return 0;
}
}
class Sort4 {
public static void main(String[] args) {
Random rnd = new Random();
Point[] points = new Point[10];
for (int i=0; i<points.length; i++) {
points[i] = new Point(rnd.nextInt(100),
rnd.nextInt(100));
}
Arrays.sort(points, new PointComparator());
//Print the points
for (int i=0; i<points.length; i++){
System.out.println(points[i]);
}
}
}
! Note !
The names on the JDCSM mailing list are
used for internal Sun MicrosystemsTM
purposes only. To remove your name from the list, see
Subscribe/Unsubscribe below.
! Feedback !
Comments? Send your feedback on the JDC Tech Tips to: jdc-webmaster
! Subscribe/Unsubscribe !
The JDC Tech Tips are sent to you because you elected to subscribe when
you registered as a JDC member. To unsubscribe from JDC email, go to the
following address and enter the email address you wish to remove from the
mailing list:
http://developer.java.sun.com/unsubscribe.html
To become a JDC member and subscribe to this newsletter go to:
http://java.sun.com/jdc/