e...

reto Sun, 07 Oct 2012 07:14:05 -0700

Author: reto
Date: Sun Oct  7 14:13:12 2012
New Revision: 1395309

URL: http://svn.apache.org/viewvc?rev=1395309&view=rev
Log:
STANBOL-762: added engine that extracts XMP metadata, it currently doesn't 
create relations between th content item and the resources from the XMP graph


Added:
    stanbol/trunk/enhancer/engines/xmpextractor/
      - copied from r1395000, stanbol/trunk/enhancer/engines/htmlextractor/
    
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/
    
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java
      - copied, changed from r1395000, 
stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java
Removed:
    
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/
    
stanbol/trunk/enhancer/engines/xmpextractor/src/test/java/org/apache/stanbol/enhancer/engines/htmlextractor/
Modified:
    stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml
    stanbol/trunk/enhancer/engines/pom.xml
    stanbol/trunk/enhancer/engines/xmpextractor/README.md
    stanbol/trunk/enhancer/engines/xmpextractor/pom.xml

Modified: stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml
URL: 
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml?rev=1395309&r1=1395308&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml (original)
+++ stanbol/trunk/enhancer/bundlelist/src/main/bundles/list.xml Sun Oct  7 
14:13:12 2012
@@ -173,6 +173,13 @@
       <artifactId>org.apache.stanbol.enhancer.engines.refactor</artifactId>
       <version>0.10.0-SNAPSHOT</version>
     </bundle>
+       
+       <!-- XMP Extractor engine -->
+       <bundle>
+               <groupId>org.apache.stanbol</groupId>
+               
<artifactId>org.apache.stanbol.enhancer.engines.xmpextractor</artifactId>
+               <version>0.10.0-SNAPSHOT</version>
+       </bundle>
 
     <!-- External Service Integration -->
 

Modified: stanbol/trunk/enhancer/engines/pom.xml
URL: 
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/pom.xml?rev=1395309&r1=1395308&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/engines/pom.xml (original)
+++ stanbol/trunk/enhancer/engines/pom.xml Sun Oct  7 14:13:12 2012
@@ -48,6 +48,7 @@
     <module>topic</module>
     <module>metaxa</module>
        <module>htmlextractor</module>
+       <module>xmpextractor</module>
     <module>tika</module>
     <module>entitytagging</module>
     <module>keywordextraction</module>

Modified: stanbol/trunk/enhancer/engines/xmpextractor/README.md
URL: 
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/xmpextractor/README.md?rev=1395309&r1=1395000&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/engines/xmpextractor/README.md (original)
+++ stanbol/trunk/enhancer/engines/xmpextractor/README.md Sun Oct  7 14:13:12 
2012
@@ -13,145 +13,5 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY 
 See the License for the specific language governing permissions and
 limitations under the License.
 
-# Htmlextractor: Metadata extraction from HTML documents
-
-The **Htmlextractor Enhancement Engine** extracts embedded metadata from HTML 
documents, such as [Microformats](http://microformats.org/) and 
[RDFa](http://www.w3.org/TR/rdfa-syntax/).
-By providing other extractors it can be configured for any kind of content 
extraction from HTML pages.
-
-##Technical description
-
-### Supported metadata types
-
-The built-in extractors are defined in the default resource 
*htmlextractors.xml*. The following metadata types are supported:
-*    RDFa
-*    geo
-*    hAtom
-*    hCal
-*    hCard
-*    hReview
-*    rel-license
-*    rel-tag
-*    xFolk
-
-### Vocabularies
-
-#### HTML Microformat Extractors
-
-The following table describes which vocabularies are used for representing 
microformat data in Metaxa: 
-
-
-<table border="1">
-    <tr>
-        <th>MF</th>
-        <th>Vocabulary (Namespace)</th>
-    </tr>
-    <tr>
-        <td>geo</td>
-        <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
-    </tr>
-    <tr>
-        <td>hAtom</td>
-        <td>atom (<tt>http://www.w3.org/2005/Atom#)</td>
-    </tr>
-    <tr>
-    <td/>
-        <td>tagging 
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
-    </tr>
-    <tr>
-        <td>hCal</td>
-        <td> ical (<tt>http://www.w3.org/2002/12/cal/icaltzd#</tt>)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
-    </tr>
-    <tr>
-        <td>hCard</td>
-        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
-    </tr>
-    <tr>
-        <td>hReview</td>
-        <td>review (<tt>http://www.purl.org/stuff/rev#</tt>)</td></tr>
-    <tr>
-        <td></td>
-        <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>dc (<tt>http://purl.org/dc/elements/1.1/</tt>)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>dcterms (<tt>http://purl.org/dc/dcmitype/</tt>)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>foaf (<tt>http://xmlns.com/foaf/0.1/</tt>)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>tag (<tt>http://www.holygoat.co.uk/owl/redwood/0.1/tags/</tt>)</td>
-    </tr>
-    <tr>
-        <td>rel-license</td>
-        <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
-    </tr>
-    <tr>
-        <td>rel-tag</td>
-        <td> tagging 
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
-    </tr>
-    <tr>
-        <td>xFolk</td>
-        <td>nfo 
(<tt>http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#</tt>)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td>tagging 
(<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
-    </tr>
-</table>
-
-To prevent the occurrence of unconnected graphs in the metadata extracted 
subgraphs get connected to the content item by the property:
-
-* http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains
-
-## Configuration options
-
-By default, the Htmlextractor engine uses the extractors specified in the 
resource "htmlregistry.xml".
-Alternative configurations and extractors can be attached to the Htmlextractor 
as fragment bundles, specifying as host bundle
-
-    Fragment-Host: org.apache.stanbol.enhancer.engines.htmlextractor
-
-The alternative configuration files then can be set as values of the property
-
-* 
<pre><code>org.apache.stanbol.enhancer.engines.htmlextractor.htmlextractors</pre></code>
-
-
-## Usage
-
-Assuming that the Stanbol endpoint with the full launcher is running at
-
-    http://localhost:8080
-
-and the engine is activated, from the command line commands like this can be 
used for submitting some file as content item, where the mime type must match 
the document type:
-
-* stateless interface
-
-    curl -i -X POST -H "Content-Type:text/html" -T testpage.html 
http://localhost:8080/enhancer
-
-* stateful interface
-
-    curl -i -X PUT -H "Content-Type:text/html" -T testpage.html 
http://localhost:8080/contenthub/content/someFileId
-
-Alternatively, the Stanbol web interface can be used for submitting documents
-and viewing the metadata at
-
-    http://localhost:8080/contenthub
+# Xmpextractor: Extracts XMP metadata from various file formats containing XMP
 

Modified: stanbol/trunk/enhancer/engines/xmpextractor/pom.xml
URL: 
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/xmpextractor/pom.xml?rev=1395309&r1=1395000&r2=1395309&view=diff
==============================================================================
--- stanbol/trunk/enhancer/engines/xmpextractor/pom.xml (original)
+++ stanbol/trunk/enhancer/engines/xmpextractor/pom.xml Sun Oct  7 14:13:12 2012
@@ -16,133 +16,103 @@
    limitations under the License.
 -->
 <project xmlns="http://maven.apache.org/POM/4.0.0"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
-       xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
+                xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
        <modelVersion>4.0.0</modelVersion>
-  <parent>
-    <artifactId>org.apache.stanbol.enhancer.parent</artifactId>
-    <groupId>org.apache.stanbol</groupId>
-    <version>0.10.0-SNAPSHOT</version>
-    <relativePath>../../parent</relativePath>
-  </parent>
+       <parent>
+               <artifactId>org.apache.stanbol.enhancer.parent</artifactId>
+               <groupId>org.apache.stanbol</groupId>
+               <version>0.10.0-SNAPSHOT</version>
+               <relativePath>../../parent</relativePath>
+       </parent>
 
        <groupId>org.apache.stanbol</groupId>
-       
<artifactId>org.apache.stanbol.enhancer.engines.htmlextractor</artifactId>
+       
<artifactId>org.apache.stanbol.enhancer.engines.xmpextractor</artifactId>
        <version>0.10.0-SNAPSHOT</version>
        <packaging>bundle</packaging>
-       <name>Apache Stanbol Enhancer Enhancement Engine : HTML 
Extractors</name>
-  <description>Enhancement Engine that extracts RDFa and Microformat data from 
HTML pages
-  </description>
-
-  <inceptionYear>2012</inceptionYear>
-
-  <scm>
-    <connection>
-      
scm:svn:http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/html/
-    </connection>
-    <developerConnection>
-      
scm:svn:https://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/html/
-    </developerConnection>
-    <url>http://stanbol.apache.org/</url>
-  </scm>
-  <build>
-    <plugins>
-      <plugin>
-        <groupId>org.apache.felix</groupId>
-        <artifactId>maven-bundle-plugin</artifactId>
-        <extensions>true</extensions>
-        <configuration>
-          <instructions>
-            <Export-Package>
-              
org.apache.stanbol.enhancer.engines.html;version=${project.version}
-            </Export-Package>
-            <Embed-Dependency>
-               jtidy;scope=compile
-            </Embed-Dependency>
-            <Import-Package>
-               !org.apache.tools.*,
-               *
-            </Import-Package>
-          </instructions>
-        </configuration>
-      </plugin>
-      <plugin>
-        <groupId>org.apache.felix</groupId>
-        <artifactId>maven-scr-plugin</artifactId>
-      </plugin>
-      <plugin>
-        <groupId>org.apache.rat</groupId>
-        <artifactId>apache-rat-plugin</artifactId>
-        <configuration>
-          <excludes>
-            <!-- AL20 License  -->
-            <exclude>src/license/THIRD-PARTY.properties</exclude>
-            <!-- AL20 License for test resources (see 
src/test/resources/README) -->
-            <exclude>src/test/resources/*.html</exclude>
-          </excludes>
-        </configuration>
-      </plugin>
-    </plugins>
-  </build>
-
-  <properties>
-    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
-  </properties>
+       <name>Apache Stanbol Enhancer Enhancement Engine : XMP Extractors</name>
+       <description>Enhancement Engine that extracts XMP data
+       </description>
+
+       <inceptionYear>2012</inceptionYear>
+
+       <scm>
+               <connection>
+                       
scm:svn:http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/xmp/
+               </connection>
+               <developerConnection>
+                       
scm:svn:https://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/xmp/
+               </developerConnection>
+               <url>http://stanbol.apache.org/</url>
+       </scm>
+       <build>
+               <plugins>
+                       <plugin>
+                               <groupId>org.apache.felix</groupId>
+                               <artifactId>maven-bundle-plugin</artifactId>
+                               <extensions>true</extensions>
+                       </plugin>
+                       <plugin>
+                               <groupId>org.apache.felix</groupId>
+                               <artifactId>maven-scr-plugin</artifactId>
+                       </plugin>
+               </plugins>
+       </build>
+
+       <properties>
+               
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+       </properties>
 
        <dependencies>
-    <dependency>
-      <groupId>org.apache.stanbol</groupId>
-      <artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
-      <version>0.10.0-SNAPSHOT</version>
-      <scope>provided</scope>
-    </dependency>
-    <dependency>
-      <groupId>org.apache.felix</groupId>
-      <artifactId>org.apache.felix.scr.annotations</artifactId>
-    </dependency>
-    <dependency>
-      <groupId>org.apache.clerezza</groupId>
-      <artifactId>rdf.core</artifactId>
-    </dependency>
-    <dependency>
-      <groupId>org.apache.clerezza</groupId>
-      <artifactId>rdf.jena.parser</artifactId>
-    </dependency>
+               <dependency>
+                       <groupId>org.apache.stanbol</groupId>
+                       
<artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
+                       <version>0.10.0-SNAPSHOT</version>
+                       <scope>provided</scope>
+               </dependency>
+               <dependency>
+                       <groupId>org.apache.felix</groupId>
+                       
<artifactId>org.apache.felix.scr.annotations</artifactId>
+               </dependency>
                <dependency>
                        <groupId>org.slf4j</groupId>
                        <artifactId>slf4j-api</artifactId>
                </dependency>
+               <dependency>
+                       <groupId>org.apache.clerezza</groupId>
+                       <artifactId>rdf.ontologies</artifactId>
+               </dependency>
+       <!-- <dependency>
+               <groupId>org.apache.pdfbox</groupId>
+               <artifactId>jempbox</artifactId>
+               <version>1.7.1</version>
+       </dependency> -->
     <!-- Test dependencies -->
-    <dependency>
-      <groupId>org.apache.stanbol</groupId>
-      <artifactId>org.apache.stanbol.enhancer.test</artifactId>
-      <version>0.10.0-SNAPSHOT</version>
-      <scope>test</scope>
-    </dependency>
-    <dependency>
-      <groupId>org.apache.stanbol</groupId>
-      <artifactId>org.apache.stanbol.enhancer.core</artifactId>
-      <version>0.10.0-SNAPSHOT</version>
-      <scope>test</scope>
-    </dependency>
-    <dependency>
-      <groupId>org.slf4j</groupId>
-      <artifactId>slf4j-simple</artifactId>
-      <scope>test</scope>
-    </dependency>   
-    <dependency>
-      <groupId>junit</groupId>
-      <artifactId>junit</artifactId>
-      <scope>test</scope>
-    </dependency>
-
-    <dependency>
-       <groupId>com.ibm.icu</groupId>
-       <artifactId>icu4j</artifactId>
-    </dependency>
-    <dependency>
-       <groupId>net.sf.jtidy</groupId>
-       <artifactId>jtidy</artifactId>
-       <version>r938</version>
-    </dependency>
+               <dependency>
+                       <groupId>org.apache.stanbol</groupId>
+                       
<artifactId>org.apache.stanbol.enhancer.test</artifactId>
+                       <version>0.10.0-SNAPSHOT</version>
+                       <scope>test</scope>
+               </dependency>
+               <dependency>
+                       <groupId>org.apache.stanbol</groupId>
+                       
<artifactId>org.apache.stanbol.enhancer.core</artifactId>
+                       <version>0.10.0-SNAPSHOT</version>
+                       <scope>test</scope>
+               </dependency>
+               <dependency>
+                       <groupId>org.slf4j</groupId>
+                       <artifactId>slf4j-simple</artifactId>
+                       <scope>test</scope>
+               </dependency>   
+               <dependency>
+                       <groupId>junit</groupId>
+                       <artifactId>junit</artifactId>
+                       <scope>test</scope>
+               </dependency>
+               <dependency>
+                       <groupId>org.apache.tika</groupId>
+                       <artifactId>tika-parsers</artifactId>
+                       <type>jar</type>
+               </dependency>
        </dependencies>
 </project>

Copied: 
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java
 (from r1395000, 
stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java)
URL: 
http://svn.apache.org/viewvc/stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java?p2=stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java&p1=stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java&r1=1395000&r2=1395309&rev=1395309&view=diff
==============================================================================
--- 
stanbol/trunk/enhancer/engines/htmlextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/HtmlExtractorEngine.java
 (original)
+++ 
stanbol/trunk/enhancer/engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/xmpextractor/XmpExtractorEngine.java
 Sun Oct  7 14:13:12 2012
@@ -14,125 +14,50 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.stanbol.enhancer.engines.htmlextractor;
+package org.apache.stanbol.enhancer.engines.xmpextractor;
 
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
 import java.io.IOException;
-import java.nio.charset.Charset;
-import java.util.Arrays;
+import java.io.InputStream;
 import java.util.Collections;
-import java.util.Dictionary;
-import java.util.HashSet;
 import java.util.Map;
-import java.util.Set;
 
-import org.apache.clerezza.rdf.core.MGraph;
-import org.apache.clerezza.rdf.core.UriRef;
-import org.apache.clerezza.rdf.core.impl.SimpleMGraph;
+import org.apache.clerezza.rdf.core.Graph;
+import org.apache.clerezza.rdf.core.serializedform.Parser;
 import org.apache.felix.scr.annotations.Component;
 import org.apache.felix.scr.annotations.Property;
 import org.apache.felix.scr.annotations.Reference;
 import org.apache.felix.scr.annotations.Service;
-import 
org.apache.stanbol.enhancer.engines.htmlextractor.impl.BundleURIResolver;
-import org.apache.stanbol.enhancer.engines.htmlextractor.impl.ClerezzaRDFUtils;
-import 
org.apache.stanbol.enhancer.engines.htmlextractor.impl.ExtractorException;
-import 
org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractionRegistry;
-import org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor;
-import org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlParser;
-import 
org.apache.stanbol.enhancer.engines.htmlextractor.impl.InitializationException;
-import org.apache.stanbol.enhancer.servicesapi.Blob;
 import org.apache.stanbol.enhancer.servicesapi.ContentItem;
-import org.apache.stanbol.enhancer.servicesapi.ContentItemFactory;
 import org.apache.stanbol.enhancer.servicesapi.EngineException;
 import org.apache.stanbol.enhancer.servicesapi.EnhancementEngine;
 import org.apache.stanbol.enhancer.servicesapi.ServiceProperties;
 import org.apache.stanbol.enhancer.servicesapi.impl.AbstractEnhancementEngine;
-import org.apache.stanbol.enhancer.servicesapi.rdf.NamespaceEnum;
-import org.osgi.framework.BundleContext;
-import org.osgi.service.cm.ConfigurationException;
-import org.osgi.service.component.ComponentContext;
+import org.apache.tika.parser.image.xmp.XMPPacketScanner;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-/**
- *
- * @author <a href="mailto:[email protected]";>Walter Kasper</a>
- * 
- */
+
 @Component(immediate = true, metatype = true, inherit = true)
 @Service
 @org.apache.felix.scr.annotations.Properties(value={
-    @Property(name=EnhancementEngine.PROPERTY_NAME, value="htmlextractor")
+    @Property(name=EnhancementEngine.PROPERTY_NAME, value="xmpextractor")
 })
-public class HtmlExtractorEngine extends 
AbstractEnhancementEngine<IOException,RuntimeException>
+public class XmpExtractorEngine extends 
AbstractEnhancementEngine<IOException,RuntimeException>
         implements EnhancementEngine, ServiceProperties {
-    private static final Logger LOG = 
LoggerFactory.getLogger(HtmlExtractorEngine.class);
+    private static final Logger LOG = 
LoggerFactory.getLogger(XmpExtractorEngine.class);
     
-    /**
-     * The default charset
-     */
-    private static final Charset UTF8 = Charset.forName("UTF-8");
-
+    
+    @Reference
+    Parser parser;
+  
     /**
      * The default value for the Execution of this Engine. Currently set to
      * {@link ServiceProperties#ORDERING_PRE_PROCESSING}
      */
     public static final Integer defaultOrder = ORDERING_PRE_PROCESSING;
 
-    private static final String DEFAULT_HTML_EXTRACTOR_REGISTRY = 
"htmlextractors.xml";
-
-    /**
-     * name of a file that defines the set of extractors for HTML documents. 
By default, the builtin file 'htmlextractors.xml' is used."
-     */
-    @Property(value=HtmlExtractorEngine.DEFAULT_HTML_EXTRACTOR_REGISTRY)
-    public static final String HTML_EXTRACTOR_REGISTRY = 
"org.apache.stanbol.enhancer.engines.htmlextractor.htmlextractors";
-
-    /**
-     * Internally used to create additional {@link Blob} for transformed
-     * versions af the original content
-     */
-    @Reference
-    private ContentItemFactory ciFactory;
-    
-    BundleContext bundleContext;
-
-    private Set<String> supportedMimeTypes = new 
HashSet<String>(Arrays.asList(new 
String[]{"text/html","application/xhtml+xml"}));
-    
-    private HtmlExtractionRegistry htmlExtractorRegistry;
-    private HtmlParser htmlParser;
-    
-    private boolean singleRootRdf = true;
-    
-    protected void activate(ComponentContext ce) throws 
ConfigurationException, IOException  {
-        super.activate(ce);
-        this.bundleContext = ce.getBundleContext();
-        BundleURIResolver.BUNDLE = this.bundleContext.getBundle();
-        String htmlExtractors = DEFAULT_HTML_EXTRACTOR_REGISTRY;
-        Dictionary<String, Object> properties = ce.getProperties();
-        String confFile = (String)properties.get(HTML_EXTRACTOR_REGISTRY);
-        if (confFile != null && confFile.trim().length() > 0) {
-            htmlExtractors = confFile;
-        }
-        try {
-            this.htmlExtractorRegistry = new 
HtmlExtractionRegistry(htmlExtractors);
-        }
-        catch (InitializationException e) {
-            LOG.error("Registry Initialization Error: " + e.getMessage());
-            throw new IOException(e.getMessage());
-        }
-        this.htmlParser = new HtmlParser();
-
-    }
-
-    /**
-     * The deactivate method.
-     *
-     * @param ce the {@link ComponentContext}
-     */
-    protected void deactivate(ComponentContext ce) {
-        super.deactivate(ce);
-        this.htmlParser = null;
-        this.htmlExtractorRegistry = null;
-    }
 
     @Override
     public Map<String,Object> getServiceProperties() {
@@ -150,35 +75,33 @@ public class HtmlExtractorEngine extends
     
     @Override
     public void computeEnhancements(ContentItem ci) throws EngineException {
-        HtmlExtractor extractor = new HtmlExtractor(htmlExtractorRegistry, 
htmlParser);
-        MGraph model = new SimpleMGraph();
-        ci.getLock().readLock().lock();
-        try {
-            extractor.extract(ci.getUri().getUnicodeString(), 
ci.getStream(),null, ci.getMimeType(), model);
-        } catch (ExtractorException e) {
-            throw new EngineException("Error while processing ContentItem "
-                    + ci.getUri()+" with HtmlExtractor",e);
-        } finally {
-            ci.getLock().readLock().unlock();
-        }
-        ClerezzaRDFUtils.urifyBlankNodes(model);
-        // make the model single rooted
-        if (singleRootRdf) {
-            ClerezzaRDFUtils.makeConnected(model,ci.getUri(),new 
UriRef(NamespaceEnum.nie+"contains"));
-        }
-        //add the extracted triples to the metadata of the ContentItem
-        ci.getLock().writeLock().lock();
-        try { 
-            LOG.info("Model: {}",model);
-            ci.getMetadata().addAll(model);
-            model = null;
-        } finally {
-            ci.getLock().writeLock().unlock();
-        }
+       InputStream in = ci.getBlob().getStream();
+       XMPPacketScanner scanner = new XMPPacketScanner();
+       ByteArrayOutputStream baos = new ByteArrayOutputStream();
+       try {
+                       scanner.parse(in, baos);
+               } catch (IOException e) {
+                       throw new EngineException(e);
+               }
+       byte[] bytes = baos.toByteArray();
+       if (bytes.length > 0) {
+               Graph model = parser.parse(new ByteArrayInputStream(bytes), 
"application/rdf+xml");
+               ci.getLock().writeLock().lock();
+               try { 
+                   LOG.info("Model: {}",model);
+                   ci.getMetadata().addAll(model);
+               } finally {
+                   ci.getLock().writeLock().unlock();
+               }
+       }
     }
     
     private boolean isSupported(String mimeType) {
-        return this.supportedMimeTypes.contains(mimeType);
+       if (mimeType.startsWith("text/")) {
+               return false; //assuming text types cannot contain XMP
+       } else {
+               return true; // As there isn't a list of media types that can 
contain XMP
+       }
     }

svn commit: r1395309 - in /stanbol/trunk/enhancer: bundlelist/src/main/bundles/ engines/ engines/xmpextractor/ engines/xmpextractor/src/main/java/org/apache/stanbol/enhancer/engines/htmlextractor/ engines/xmpextractor/src/main/java/org/apache/stanbol/e...

Reply via email to