Hello Marco,
an example is

<extractor classname="nl.hippo.slide.extractor.UrlListXMLPropertyExtractor"
uri="/files/default.preview" content-type="text/xml | text/xml;
charset=UTF-8 | application/xml">
 <configuration>
  <instruction property="links" namespace="http://hippo.nl/cms/1.0";
xpath="//@href|//@src|//datasource/text()|//bannerUrl/text()|//logoUrl/text()"/>
 </configuration>
</extractor>

As you see the xpaths are concatenated from several parts in the XML. I'm
not really sure if the xpath engine also supports
//@href[starts-with(.,'/content/')] to filter internal only links.
Hope this helps you,

Jasha Joachimsthal

[email protected] - [email protected]

www.onehippo.com
Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
San Francisco - Hippo USA Inc. 185 H Street, suite B, Petaluma CA 94952 +1
(707) 7734646



2009/7/3 Marco Casavecchia Morganti <[email protected]>

> Hello all,
>
> I would like to set up the broken link checker for my CMS, but before
> start, i need to know if i understood how does it works.
> As far as know, the checker creates an XML file into the repository that
>  is the "database" of the inspected links.
> To create this document it needs to browse the repository in search of a
> webdavProperty called "links".
>
> So, if this is right, i need to configure an extractor into the repository.
>
> Now, I have a document like this:
> ------------------------------------
> <?xml version="1.0" encoding="UTF-8"?>
> <document>
>  <metaCurSection>multimedia</metaCurSection>
>  <taxonomies> </taxonomies>
>  <primaryData lang="it">
>    <content >
>      <html>
>        <body>
>          <a href="http://www.google.com"; title="prova">testlink</a>
>        </body>
>      </html>
>    </content>
>    <shortDescription />
>    <title>Test di Impaginazione Template</title>
>  </primaryData>
>  <attachments lang="it">
>    <externalLinks>
>      <externalLink label="Prova" order="1" url="http://www.google.com/"/>
>    </externalLinks>
>    <assets>
>      <asset order="1" path="/binaries/sandbox/urb_part.gif"/>
>    </assets>
>    <images>
>      <image alt="Prova Formattazione" order="1"
> path="/binaries/sandbox/nx03_wallpaper01.jpg"/>
>    </images>
>    <relatedDocs>
>      <relatedDoc order="1"
> path="/content/taxonomies/ankonline/uffici/stampa/conferenze/2007/nuovodoc.xml"/>
>    </relatedDocs>
>  </attachments>
>  <multimedia lang="it">
>    <stream externalPath="/video/test.flv" repository="external"/>
>  </multimedia>
>  <secondaryData lang="it">
>    <tickets />
>    <other />
>  </secondaryData>
>  <contacts lang="it">
>    <info />
>    <timeTable />
>    <telephones>
>      <telephone number="112324345" order="1"/>
>    </telephones>
>    <faxes>
>      <fax number="12121341" order="2"/>
>    </faxes>
>    <emails>
>      <email address="[email protected]" order="3"/>
>    </emails>
>  </contacts>
> </document>
>
> I have to extract:
> - The links on the html fields like "/document/PrimaryData/content"
> - The extrenal Links on "/document/Attachments/extrenalLinks/externalLink"
> - The internal Links on "/document/Attachments/relatedDocs/relatedDoc"
> - The images on "document/Attachments/images/image"
> - The assets on "document/Attachments/assets/asset"
>
> Can someone show me an example for an extractor configuration?
> Thanks in advance.
>
> --
> By MCM.
>
> << La teoria è quando si sa tutto ma non funziona niente.
> La pratica è quando tutto funziona ma non si sa il perché.
> In ogni caso si finisce con il coniugare la teoria con la pratica: non
> funziona niente e non si sa il perché. >>
> (A. Einstein)
> ********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
>
> Searchable archives can be found at:
> MarkMail: http://hippocms-dev.markmail.org
> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>
>
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Reply via email to