Isn't gzip infamous for being slow? HBase is using LZO for this reason.
Apache Thrift
also using something else.
I can recommend to use UIMA-AS and let it process data stored in HBase,
we run
a couple of UIMA-AS services on the same machines hosting Hadoop and got
good
results. Anyway since we are using OpenNLP, the bottleneck we hit is CPU
power.
The first optimization I would do is to get data locality with UIMA-AS,
then the
network bottleneck vanishes.
What kind of analysis do you run? Is it also CPU intensive?
Jörn
On 7/14/11 10:32 PM, Greg Holmberg (JIRA) wrote:
[
https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516
]
Greg Holmberg commented on UIMA-2128:
-------------------------------------
I'm not sure where in the code it should be implemented (which seems to be
Jörn's point), but another technology option is EXI, a binary encoding for XML.
See http://www.w3.org/XML/EXI
I've experimented with this to encode XMI between a UIMA process and a database
on separate machine (both ends Java). I used a commercial implementation,
Efficient XML from AgileDelta, and did some throughput measurements, comparing
it to gzipped XMI XML. I found that EXI produces somewhat smaller data than
gzipped XML text (maybe 10% or 20% smaller, if I remember correctly). The
biggest benefit to EXI was the amount of CPU time required to read and write.
It was quite a bit faster than gzip to generate the XML and parse the XML.
This is probably because it does so directly from the ContentHandler, whereas
with gzipped text, you first have to write the text and then compress it.
Also, I suppose it's just more efficient to parse a binary format than to step
through characters looking for certain tokens.
In my case, the improved throughput and reduction of CPU usage was most
important on the receiving end (i.e. the database) since it is a central
bottle-neck in the overall landscape of my system. As the number of UIMA
senders increases, the database reaches it's limits to handle more messages (no
more CPU or NIC capacity on that machine). So it was important to me to be as
efficient as possible with the XMI parsing on that machine in order to minimize
my hardware costs.
In my case, all annotators are local, but a similar bottle-neck situation could
arise in UIMA AS if you have a remote annotator (service). Then, making that
UIMA processor as efficient as possible in terms of both CPU and network
bandwidth usage becomes important. GZip will help a lot compared to plain
text, but EXI is even better, especially to reduce the CPU usage of XML
generating and parsing, but also somewhat on the network bandwidth.
Some open-source implementations of EXI are listed here:
http://en.wikipedia.org/wiki/Efficient_XML_Interchange
I also tried the open-source Java implementation, EXIficient. In early 2010 it
implemented the standard technically correctly, but was immature, slow (really,
really slow!), and used a lot of memory. However, it's been a year since, so
maybe it's improved since then. I talked to the developers (from Siemens)
about their use of the GPL license, and they were not interested in changing to
an Apache-compatible license, so that may be an issue for use in UIMA.
I have not tried the other open-source Java implementation, OpenEXI. This uses
the Apache license though. There's some discussion of gzipped XML text versus
EXI here: http://openexi.sourceforge.net/#whynotgzip
There's also an open-source C implementation, called EXIP. I don't know
anything about it.
Support to for gzipped XMI files
--------------------------------
Key: UIMA-2128
URL: https://issues.apache.org/jira/browse/UIMA-2128
Project: UIMA
Issue Type: Wish
Components: CasEditor
Reporter: Richard Eckart de Castilho
Since XMI files tend to grow rather rapidly, it would be great if the CAS
Editor supported to read and write gzipped XMI files (.xmi.gz).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira