Thanks Jasha,
I understood. So I don't need to change anything into the dasl-indexer.xml.

I have another question: Is there a way to force a "reindex" of the repository in order to create such properties for my existing documents?

Jasha Joachimsthal wrote:
Hello Marco,
an example is

<extractor classname="nl.hippo.slide.extractor.UrlListXMLPropertyExtractor"
uri="/files/default.preview" content-type="text/xml | text/xml;
charset=UTF-8 | application/xml">
 <configuration>
  <instruction property="links" namespace="http://hippo.nl/cms/1.0";
xpath="//@href|//@src|//datasource/text()|//bannerUrl/text()|//logoUrl/text()"/>
 </configuration>
</extractor>

As you see the xpaths are concatenated from several parts in the XML. I'm
not really sure if the xpath engine also supports
//@href[starts-with(.,'/content/')] to filter internal only links.
Hope this helps you,

Jasha Joachimsthal

[email protected] - [email protected]

www.onehippo.com
Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
San Francisco - Hippo USA Inc. 185 H Street, suite B, Petaluma CA 94952 +1
(707) 7734646



2009/7/3 Marco Casavecchia Morganti <[email protected]>

Hello all,

I would like to set up the broken link checker for my CMS, but before
start, i need to know if i understood how does it works.
As far as know, the checker creates an XML file into the repository that
 is the "database" of the inspected links.
To create this document it needs to browse the repository in search of a
webdavProperty called "links".

So, if this is right, i need to configure an extractor into the repository.

Now, I have a document like this:
------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<document>
 <metaCurSection>multimedia</metaCurSection>
 <taxonomies> </taxonomies>
 <primaryData lang="it">
   <content >
     <html>
       <body>
         <a href="http://www.google.com"; title="prova">testlink</a>
       </body>
     </html>
   </content>
   <shortDescription />
   <title>Test di Impaginazione Template</title>
 </primaryData>
 <attachments lang="it">
   <externalLinks>
     <externalLink label="Prova" order="1" url="http://www.google.com/"/>
   </externalLinks>
   <assets>
     <asset order="1" path="/binaries/sandbox/urb_part.gif"/>
   </assets>
   <images>
     <image alt="Prova Formattazione" order="1"
path="/binaries/sandbox/nx03_wallpaper01.jpg"/>
   </images>
   <relatedDocs>
     <relatedDoc order="1"
path="/content/taxonomies/ankonline/uffici/stampa/conferenze/2007/nuovodoc.xml"/>
   </relatedDocs>
 </attachments>
 <multimedia lang="it">
   <stream externalPath="/video/test.flv" repository="external"/>
 </multimedia>
 <secondaryData lang="it">
   <tickets />
   <other />
 </secondaryData>
 <contacts lang="it">
   <info />
   <timeTable />
   <telephones>
     <telephone number="112324345" order="1"/>
   </telephones>
   <faxes>
     <fax number="12121341" order="2"/>
   </faxes>
   <emails>
     <email address="[email protected]" order="3"/>
   </emails>
 </contacts>
</document>

I have to extract:
- The links on the html fields like "/document/PrimaryData/content"
- The extrenal Links on "/document/Attachments/extrenalLinks/externalLink"
- The internal Links on "/document/Attachments/relatedDocs/relatedDoc"
- The images on "document/Attachments/images/image"
- The assets on "document/Attachments/assets/asset"

Can someone show me an example for an extractor configuration?
Thanks in advance.

--
By MCM.

<< La teoria è quando si sa tutto ma non funziona niente.
La pratica è quando tutto funziona ma non si sa il perché.
In ogni caso si finisce con il coniugare la teoria con la pratica: non
funziona niente e non si sa il perché. >>
(A. Einstein)
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html


********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html




--
By MCM.

<< La teoria è quando si sa tutto ma non funziona niente.
La pratica è quando tutto funziona ma non si sa il perché.
In ogni caso si finisce con il coniugare la teoria con la pratica: non funziona niente e non si sa il perché. >>
(A. Einstein)
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Reply via email to