Author: reto
Date: Sun Oct 7 14:13:12 2012
New Revision: 1395309
URL: http://svn.apache.org/viewvc?rev=1395309&view=rev
Log:
STANBOL-762: added engine that extracts XMP metadata, it currently doesn't
create relations between th content item and the resources from the XMP graph
Added:
stanbol/trunk/enhancer/engines/xmpextractor/
- copied from r1395000, stanbol/trunk/enhancer/engines/htmlextractor/
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java
- copied, changed from r1395000,
stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java
Removed:
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/
stanbol/trunk/enhancer/engines/xmpextractor/src/test/java/org/apache/stanbol/enhancer/engines/htmlextractor/
Modified:
stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml
stanbol/trunk/enhancer/engines/pom.xml
stanbol/trunk/enhancer/engines/xmpextractor/README.md
stanbol/trunk/enhancer/engines/xmpextractor/pom.xml
Modified: stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml?rev=1395309&r1=1395308&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml (original)
+++ stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml Sun Oct 7
14:13:12 2012
@@ -173,6 +173,13 @@
<artifactId>org.apache.stanbol.enhancer.engines.refactor</artifactId>
<version>0.10.0-SNAPSHOT</version>
</bundle>
+
+ <!-- XMP Extractor engine -->
+ <bundle>
+ <groupId>org.apache.stanbol</groupId>
+
<artifactId>org.apache.stanbol.enhancer.engines.xmpextractor</artifactId>
+ <version>0.10.0-SNAPSHOT</version>
+ </bundle>
<!-- External Service Integration -->
Modified: stanbol/trunk/enhancer/engines/pom.xml
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/pom.xml?rev=1395309&r1=1395308&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/engines/pom.xml (original)
+++ stanbol/trunk/enhancer/engines/pom.xml Sun Oct 7 14:13:12 2012
@@ -48,6 +48,7 @@
<module>topic</module>
<module>metaxa</module>
<module>htmlextractor</module>
+ <module>xmpextractor</module>
<module>tika</module>
<module>entitytagging</module>
<module>keywordextraction</module>
Modified: stanbol/trunk/enhancer/engines/xmpextractor/README.md
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/xmpextractor/README.md?rev=1395309&r1=1395000&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/engines/xmpextractor/README.md (original)
+++ stanbol/trunk/enhancer/engines/xmpextractor/README.md Sun Oct 7 14:13:12
2012
@@ -13,145 +13,5 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY
See the License for the specific language governing permissions and
limitations under the License.
-# Htmlextractor: Metadata extraction from HTML documents
-
-The **Htmlextractor Enhancement Engine** extracts embedded metadata from HTML
documents, such as [Microformats](http://microformats.org/) and
[RDFa](http://www.w3.org/TR/rdfa-syntax/).
-By providing other extractors it can be configured for any kind of content
extraction from HTML pages.
-
-##Technical description
-
-### Supported metadata types
-
-The built-in extractors are defined in the default resource
*htmlextractors.xml*. The following metadata types are supported:
-* RDFa
-* geo
-* hAtom
-* hCal
-* hCard
-* hReview
-* rel-license
-* rel-tag
-* xFolk
-
-### Vocabularies
-
-#### HTML Microformat Extractors
-
-The following table describes which vocabularies are used for representing
microformat data in Metaxa:
-
-
-<table border="1">
- <tr>
- <th>MF</th>
- <th>Vocabulary (Namespace)</th>
- </tr>
- <tr>
- <td>geo</td>
- <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
- </tr>
- <tr>
- <td>hAtom</td>
- <td>atom (<tt>http://www.w3.org/2005/Atom#)</td>
- </tr>
- <tr>
- <td/>
- <td>tagging
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
- </tr>
- <tr>
- <td>hCal</td>
- <td> ical (<tt>http://www.w3.org/2002/12/cal/icaltzd#</tt>)</td>
- </tr>
- <tr>
- <td></td>
- <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
- </tr>
- <tr>
- <td>hCard</td>
- <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
- </tr>
- <tr>
- <td>hReview</td>
- <td>review (<tt>http://www.purl.org/stuff/rev#</tt>)</td></tr>
- <tr>
- <td></td>
- <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
- </tr>
- <tr>
- <td></td>
- <td>dc (<tt>http://purl.org/dc/elements/1.1/</tt>)</td>
- </tr>
- <tr>
- <td></td>
- <td>dcterms (<tt>http://purl.org/dc/dcmitype/</tt>)</td>
- </tr>
- <tr>
- <td></td>
- <td>foaf (<tt>http://xmlns.com/foaf/0.1/</tt>)</td>
- </tr>
- <tr>
- <td></td>
- <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
- </tr>
- <tr>
- <td></td>
- <td>tag (<tt>http://www.holygoat.co.uk/owl/redwood/0.1/tags/</tt>)</td>
- </tr>
- <tr>
- <td>rel-license</td>
- <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
- </tr>
- <tr>
- <td>rel-tag</td>
- <td> tagging
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
- </tr>
- <tr>
- <td>xFolk</td>
- <td>nfo
(<tt>http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#</tt>)</td>
- </tr>
- <tr>
- <td></td>
- <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
- </tr>
- <tr>
- <td></td>
- <td>tagging
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
- </tr>
-</table>
-
-To prevent the occurrence of unconnected graphs in the metadata extracted
subgraphs get connected to the content item by the property:
-
-* http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains
-
-## Configuration options
-
-By default, the Htmlextractor engine uses the extractors specified in the
resource "htmlregistry.xml".
-Alternative configurations and extractors can be attached to the Htmlextractor
as fragment bundles, specifying as host bundle
-
- Fragment-Host: org.apache.stanbol.enhancer.engines.htmlextractor
-
-The alternative configuration files then can be set as values of the property
-
-*
<pre><code>org.apache.stanbol.enhancer.engines.htmlextractor.htmlextractors</pre></code>
-
-
-## Usage
-
-Assuming that the Stanbol endpoint with the full launcher is running at
-
- http://localhost:8080
-
-and the engine is activated, from the command line commands like this can be
used for submitting some file as content item, where the mime type must match
the document type:
-
-* stateless interface
-
- curl -i -X POST -H "Content-Type:text/html" -T testpage.html
http://localhost:8080/enhancer
-
-* stateful interface
-
- curl -i -X PUT -H "Content-Type:text/html" -T testpage.html
http://localhost:8080/contenthub/content/someFileId
-
-Alternatively, the Stanbol web interface can be used for submitting documents
-and viewing the metadata at
-
- http://localhost:8080/contenthub
+# Xmpextractor: Extracts XMP metadata from various file formats containing XMP
Modified: stanbol/trunk/enhancer/engines/xmpextractor/pom.xml
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/xmpextractor/pom.xml?rev=1395309&r1=1395000&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/engines/xmpextractor/pom.xml (original)
+++ stanbol/trunk/enhancer/engines/xmpextractor/pom.xml Sun Oct 7 14:13:12 2012
@@ -16,133 +16,103 @@
limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
- <parent>
- <artifactId>org.apache.stanbol.enhancer.parent</artifactId>
- <groupId>org.apache.stanbol</groupId>
- <version>0.10.0-SNAPSHOT</version>
- <relativePath>../../parent</relativePath>
- </parent>
+ <parent>
+ <artifactId>org.apache.stanbol.enhancer.parent</artifactId>
+ <groupId>org.apache.stanbol</groupId>
+ <version>0.10.0-SNAPSHOT</version>
+ <relativePath>../../parent</relativePath>
+ </parent>
<groupId>org.apache.stanbol</groupId>
-
<artifactId>org.apache.stanbol.enhancer.engines.htmlextractor</artifactId>
+
<artifactId>org.apache.stanbol.enhancer.engines.xmpextractor</artifactId>
<version>0.10.0-SNAPSHOT</version>
<packaging>bundle</packaging>
- <name>Apache Stanbol Enhancer Enhancement Engine : HTML
Extractors</name>
- <description>Enhancement Engine that extracts RDFa and Microformat data from
HTML pages
- </description>
-
- <inceptionYear>2012</inceptionYear>
-
- <scm>
- <connection>
-
scm:svn:http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/html/
- </connection>
- <developerConnection>
-
scm:svn:https://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/html/
- </developerConnection>
- <url>http://stanbol.apache.org/</url>
- </scm>
- <build>
- <plugins>
- <plugin>
- <groupId>org.apache.felix</groupId>
- <artifactId>maven-bundle-plugin</artifactId>
- <extensions>true</extensions>
- <configuration>
- <instructions>
- <Export-Package>
-
org.apache.stanbol.enhancer.engines.html;version=${project.version}
- </Export-Package>
- <Embed-Dependency>
- jtidy;scope=compile
- </Embed-Dependency>
- <Import-Package>
- !org.apache.tools.*,
- *
- </Import-Package>
- </instructions>
- </configuration>
- </plugin>
- <plugin>
- <groupId>org.apache.felix</groupId>
- <artifactId>maven-scr-plugin</artifactId>
- </plugin>
- <plugin>
- <groupId>org.apache.rat</groupId>
- <artifactId>apache-rat-plugin</artifactId>
- <configuration>
- <excludes>
- <!-- AL20 License -->
- <exclude>src/license/THIRD-PARTY.properties</exclude>
- <!-- AL20 License for test resources (see
src/test/resources/README) -->
- <exclude>src/test/resources/*.html</exclude>
- </excludes>
- </configuration>
- </plugin>
- </plugins>
- </build>
-
- <properties>
- <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
- </properties>
+ <name>Apache Stanbol Enhancer Enhancement Engine : XMP Extractors</name>
+ <description>Enhancement Engine that extracts XMP data
+ </description>
+
+ <inceptionYear>2012</inceptionYear>
+
+ <scm>
+ <connection>
+
scm:svn:http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/xmp/
+ </connection>
+ <developerConnection>
+
scm:svn:https://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/xmp/
+ </developerConnection>
+ <url>http://stanbol.apache.org/</url>
+ </scm>
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.felix</groupId>
+ <artifactId>maven-bundle-plugin</artifactId>
+ <extensions>true</extensions>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.felix</groupId>
+ <artifactId>maven-scr-plugin</artifactId>
+ </plugin>
+ </plugins>
+ </build>
+
+ <properties>
+
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+ </properties>
<dependencies>
- <dependency>
- <groupId>org.apache.stanbol</groupId>
- <artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
- <version>0.10.0-SNAPSHOT</version>
- <scope>provided</scope>
- </dependency>
- <dependency>
- <groupId>org.apache.felix</groupId>
- <artifactId>org.apache.felix.scr.annotations</artifactId>
- </dependency>
- <dependency>
- <groupId>org.apache.clerezza</groupId>
- <artifactId>rdf.core</artifactId>
- </dependency>
- <dependency>
- <groupId>org.apache.clerezza</groupId>
- <artifactId>rdf.jena.parser</artifactId>
- </dependency>
+ <dependency>
+ <groupId>org.apache.stanbol</groupId>
+
<artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
+ <version>0.10.0-SNAPSHOT</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.felix</groupId>
+
<artifactId>org.apache.felix.scr.annotations</artifactId>
+ </dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
+ <dependency>
+ <groupId>org.apache.clerezza</groupId>
+ <artifactId>rdf.ontologies</artifactId>
+ </dependency>
+ <!-- <dependency>
+ <groupId>org.apache.pdfbox</groupId>
+ <artifactId>jempbox</artifactId>
+ <version>1.7.1</version>
+ </dependency> -->
<!-- Test dependencies -->
- <dependency>
- <groupId>org.apache.stanbol</groupId>
- <artifactId>org.apache.stanbol.enhancer.test</artifactId>
- <version>0.10.0-SNAPSHOT</version>
- <scope>test</scope>
- </dependency>
- <dependency>
- <groupId>org.apache.stanbol</groupId>
- <artifactId>org.apache.stanbol.enhancer.core</artifactId>
- <version>0.10.0-SNAPSHOT</version>
- <scope>test</scope>
- </dependency>
- <dependency>
- <groupId>org.slf4j</groupId>
- <artifactId>slf4j-simple</artifactId>
- <scope>test</scope>
- </dependency>
- <dependency>
- <groupId>junit</groupId>
- <artifactId>junit</artifactId>
- <scope>test</scope>
- </dependency>
-
- <dependency>
- <groupId>com.ibm.icu</groupId>
- <artifactId>icu4j</artifactId>
- </dependency>
- <dependency>
- <groupId>net.sf.jtidy</groupId>
- <artifactId>jtidy</artifactId>
- <version>r938</version>
- </dependency>
+ <dependency>
+ <groupId>org.apache.stanbol</groupId>
+
<artifactId>org.apache.stanbol.enhancer.test</artifactId>
+ <version>0.10.0-SNAPSHOT</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.stanbol</groupId>
+
<artifactId>org.apache.stanbol.enhancer.core</artifactId>
+ <version>0.10.0-SNAPSHOT</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-simple</artifactId>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>junit</groupId>
+ <artifactId>junit</artifactId>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-parsers</artifactId>
+ <type>jar</type>
+ </dependency>
</dependencies>
</project>
Copied:
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java
(from r1395000,
stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java)
URL:
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java?p2=stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java&p1=stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java&r1=1395000&r2=1395309&rev=1395309&view=diff
==============================================================================
---
stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java
(original)
+++
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java
Sun Oct 7 14:13:12 2012
@@ -14,125 +14,50 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
-package org.apache.stanbol.enhancer.engines.htmlextractor;
+package org.apache.stanbol.enhancer.engines.xmpextractor;
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
import java.io.IOException;
-import java.nio.charset.Charset;
-import java.util.Arrays;
+import java.io.InputStream;
import java.util.Collections;
-import java.util.Dictionary;
-import java.util.HashSet;
import java.util.Map;
-import java.util.Set;
-import org.apache.clerezza.rdf.core.MGraph;
-import org.apache.clerezza.rdf.core.UriRef;
-import org.apache.clerezza.rdf.core.impl.SimpleMGraph;
+import org.apache.clerezza.rdf.core.Graph;
+import org.apache.clerezza.rdf.core.serializedform.Parser;
import org.apache.felix.scr.annotations.Component;
import org.apache.felix.scr.annotations.Property;
import org.apache.felix.scr.annotations.Reference;
import org.apache.felix.scr.annotations.Service;
-import
org.apache.stanbol.enhancer.engines.htmlextractor.impl.BundleURIResolver;
-import org.apache.stanbol.enhancer.engines.htmlextractor.impl.ClerezzaRDFUtils;
-import
org.apache.stanbol.enhancer.engines.htmlextractor.impl.ExtractorException;
-import
org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractionRegistry;
-import org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor;
-import org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlParser;
-import
org.apache.stanbol.enhancer.engines.htmlextractor.impl.InitializationException;
-import org.apache.stanbol.enhancer.servicesapi.Blob;
import org.apache.stanbol.enhancer.servicesapi.ContentItem;
-import org.apache.stanbol.enhancer.servicesapi.ContentItemFactory;
import org.apache.stanbol.enhancer.servicesapi.EngineException;
import org.apache.stanbol.enhancer.servicesapi.EnhancementEngine;
import org.apache.stanbol.enhancer.servicesapi.ServiceProperties;
import org.apache.stanbol.enhancer.servicesapi.impl.AbstractEnhancementEngine;
-import org.apache.stanbol.enhancer.servicesapi.rdf.NamespaceEnum;
-import org.osgi.framework.BundleContext;
-import org.osgi.service.cm.ConfigurationException;
-import org.osgi.service.component.ComponentContext;
+import org.apache.tika.parser.image.xmp.XMPPacketScanner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-/**
- *
- * @author <a href="mailto:[email protected]">Walter Kasper</a>
- *
- */
+
@Component(immediate = true, metatype = true, inherit = true)
@Service
@org.apache.felix.scr.annotations.Properties(value={
- @Property(name=EnhancementEngine.PROPERTY_NAME, value="htmlextractor")
+ @Property(name=EnhancementEngine.PROPERTY_NAME, value="xmpextractor")
})
-public class HtmlExtractorEngine extends
AbstractEnhancementEngine<IOException,RuntimeException>
+public class XmpExtractorEngine extends
AbstractEnhancementEngine<IOException,RuntimeException>
implements EnhancementEngine, ServiceProperties {
- private static final Logger LOG =
LoggerFactory.getLogger(HtmlExtractorEngine.class);
+ private static final Logger LOG =
LoggerFactory.getLogger(XmpExtractorEngine.class);
- /**
- * The default charset
- */
- private static final Charset UTF8 = Charset.forName("UTF-8");
-
+
+ @Reference
+ Parser parser;
+
/**
* The default value for the Execution of this Engine. Currently set to
* {@link ServiceProperties#ORDERING_PRE_PROCESSING}
*/
public static final Integer defaultOrder = ORDERING_PRE_PROCESSING;
- private static final String DEFAULT_HTML_EXTRACTOR_REGISTRY =
"htmlextractors.xml";
-
- /**
- * name of a file that defines the set of extractors for HTML documents.
By default, the builtin file 'htmlextractors.xml' is used."
- */
- @Property(value=HtmlExtractorEngine.DEFAULT_HTML_EXTRACTOR_REGISTRY)
- public static final String HTML_EXTRACTOR_REGISTRY =
"org.apache.stanbol.enhancer.engines.htmlextractor.htmlextractors";
-
- /**
- * Internally used to create additional {@link Blob} for transformed
- * versions af the original content
- */
- @Reference
- private ContentItemFactory ciFactory;
-
- BundleContext bundleContext;
-
- private Set<String> supportedMimeTypes = new
HashSet<String>(Arrays.asList(new
String[]{"text/html","application/xhtml+xml"}));
-
- private HtmlExtractionRegistry htmlExtractorRegistry;
- private HtmlParser htmlParser;
-
- private boolean singleRootRdf = true;
-
- protected void activate(ComponentContext ce) throws
ConfigurationException, IOException {
- super.activate(ce);
- this.bundleContext = ce.getBundleContext();
- BundleURIResolver.BUNDLE = this.bundleContext.getBundle();
- String htmlExtractors = DEFAULT_HTML_EXTRACTOR_REGISTRY;
- Dictionary<String, Object> properties = ce.getProperties();
- String confFile = (String)properties.get(HTML_EXTRACTOR_REGISTRY);
- if (confFile != null && confFile.trim().length() > 0) {
- htmlExtractors = confFile;
- }
- try {
- this.htmlExtractorRegistry = new
HtmlExtractionRegistry(htmlExtractors);
- }
- catch (InitializationException e) {
- LOG.error("Registry Initialization Error: " + e.getMessage());
- throw new IOException(e.getMessage());
- }
- this.htmlParser = new HtmlParser();
-
- }
-
- /**
- * The deactivate method.
- *
- * @param ce the {@link ComponentContext}
- */
- protected void deactivate(ComponentContext ce) {
- super.deactivate(ce);
- this.htmlParser = null;
- this.htmlExtractorRegistry = null;
- }
@Override
public Map<String,Object> getServiceProperties() {
@@ -150,35 +75,33 @@ public class HtmlExtractorEngine extends
@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
- HtmlExtractor extractor = new HtmlExtractor(htmlExtractorRegistry,
htmlParser);
- MGraph model = new SimpleMGraph();
- ci.getLock().readLock().lock();
- try {
- extractor.extract(ci.getUri().getUnicodeString(),
ci.getStream(),null, ci.getMimeType(), model);
- } catch (ExtractorException e) {
- throw new EngineException("Error while processing ContentItem "
- + ci.getUri()+" with HtmlExtractor",e);
- } finally {
- ci.getLock().readLock().unlock();
- }
- ClerezzaRDFUtils.urifyBlankNodes(model);
- // make the model single rooted
- if (singleRootRdf) {
- ClerezzaRDFUtils.makeConnected(model,ci.getUri(),new
UriRef(NamespaceEnum.nie+"contains"));
- }
- //add the extracted triples to the metadata of the ContentItem
- ci.getLock().writeLock().lock();
- try {
- LOG.info("Model: {}",model);
- ci.getMetadata().addAll(model);
- model = null;
- } finally {
- ci.getLock().writeLock().unlock();
- }
+ InputStream in = ci.getBlob().getStream();
+ XMPPacketScanner scanner = new XMPPacketScanner();
+ ByteArrayOutputStream baos = new ByteArrayOutputStream();
+ try {
+ scanner.parse(in, baos);
+ } catch (IOException e) {
+ throw new EngineException(e);
+ }
+ byte[] bytes = baos.toByteArray();
+ if (bytes.length > 0) {
+ Graph model = parser.parse(new ByteArrayInputStream(bytes),
"application/rdf+xml");
+ ci.getLock().writeLock().lock();
+ try {
+ LOG.info("Model: {}",model);
+ ci.getMetadata().addAll(model);
+ } finally {
+ ci.getLock().writeLock().unlock();
+ }
+ }
}
private boolean isSupported(String mimeType) {
- return this.supportedMimeTypes.contains(mimeType);
+ if (mimeType.startsWith("text/")) {
+ return false; //assuming text types cannot contain XMP
+ } else {
+ return true; // As there isn't a list of media types that can
contain XMP
+ }
}