Thanks for the fast reply.
I think the "Date" field is not what I mean.
I tried to get it in MoreIndexingFilter.java and get a date but this is the
fetching date and not the http-equiv="Last-Modified" html meta tag in the
HTML files.
So my question is, why is this wrong?
String lastModified = data.getMeta(Metadata.LAST_MODIFIED);
if (lastModified != null) { // try parse last-modified
time = getTime(lastModified,url); // use as time
// store as string
doc.add(new Field("lastModified", new Long(time).toString(),
Field.Store.YES, Field.Index.NO));
}
I do not understand, why lastModified is null. Because as mentioned before,
in my other thread the last-modified tag is pared correctly!
Regards,
Sebastian
Susam Pal wrote:
>
> I tried this code:-
>
> System.out.println(metaData);
> String[] names = metaData.names();
> for (int i = 0; i < names.length; i++) {
> System.out.println(names[i] + ": " + metaData.get(names[i]));
> }
>
> I got this:-
>
> nutch.content.digest=96f6d3d267d955728fc98b820fc72c32 Date=Tue, 25 Sep
> 2007 15:53:40 GMT Content-Length=73 nutch.crawl.score=1.0
> nutch.segment.name=20070925212336 Content-Type=text/html;
> charset=UTF-8 Server=Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
> X-Powered-By=PHP/5.2.0-8+etch7 _ftk_=1190735620303
> nutch.content.digest: 96f6d3d267d955728fc98b820fc72c32
> Date: Tue, 25 Sep 2007 15:53:40 GMT
> Content-Length: 73
> nutch.crawl.score: 1.0
> nutch.segment.name: 20070925212336
> Content-Type: text/html; charset=UTF-8
> Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
> X-Powered-By: PHP/5.2.0-8+etch7
> _ftk_: 1190735620303
>
> So, metaData.get("Date") is one good solution.
>
> I wonder why the date is stored against "Date" whereas DublinCore
> interface (which Metadata implements) defines DATE as:-
>
> public static final String DATE = "date";
>
> Regards,
> Susam Pal
> http://susam.in/
>
> On 9/25/07, Sebastian Schick <[EMAIL PROTECTED]> wrote:
>>
>> Hello,
>>
>> we have the same problem. Accidentally I created a new thread
>> http://www.nabble.com/problem-with-MoreIndexingFilter-tf4515835.html#a12880357
>> here .
>> Are there already any solutions?
>>
>> Regards,
>>
>> Sebastian
>>
>>
>> chris sleeman wrote:
>> >
>> > Hi,
>> >
>> > Can anyone tell me how to get the last-modified or the creation time of
>> a
>> > page, crawled and indexed by nutch?
>> > I tried using the Metadata.LAST_MODIFIED field but it returned me null.
>> I
>> > need them while displaying my search results.
>> >
>> > Would appreciate any pointers on this.
>> >
>> > Regards,
>> > Chris
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12881175
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12885648
Sent from the Nutch - User mailing list archive at Nabble.com.