Thanks enis,
am getting some idea from that..
Can you tell me in which class i should implement that.
I havent have hadoop install on my box.
Awaiting,
Cha
Enis Soztutar wrote:
>
> the crawldb is a serialization of a hadoop's
> org.apache.hadoop.io.MapFile object. This structure contains two
> SequenceFiles, one for data and one for index. This is an excerpt from
> the javadoc of the MapFile class:
>
> A file-based map from keys to values.
> *
> * <p>A map is a directory containing two files, the <code>data</code>
> file,
> * containing all keys and values in the map, and a smaller
> <code>index</code>
> * file, containing a fraction of the keys. The fraction is determined by
> * [EMAIL PROTECTED] Writer#getIndexInterval()}.
>
> MapFile.Reader class is for reading the contents of the map file. By
> using this class, you can enumerate all the entries of the map file. And
> since the keys of the crawldb are Text objects containing urls, you can
> just dump the keys one by one to another file. Try the following :
>
>
> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
>
> Class keyC = reader.getKeyClass();
> Class valueC = reader.getValueClass();
>
> while (true) {
> WritableComparable key = null;
> Writable value = null;
> try {
> key = (WritableComparable)keyC.newInstance();
> value = (Writable)valueC.newInstance();
> } catch (Exception ex) {
> ex.printStackTrace();
> System.exit(-1);
> }
>
> try {
> if (!reader.next(key, value)) {
> break;
> }
>
> out.println(key);
> out.println(value);
> } catch (Exception e) {
> e.printStackTrace();
> out.println("Exception occured. " + e);
> break;
> }
> }
>
> This code is just for demonstration, of course you can customize it for
> you needs, for example printing in xml format. you can check the
> javadocs of CrawlDatum, Crawldb, Text, MapFile, SequenceFile classes
> for further insight.
>
>
> cha wrote:
>> Hi Enis,
>>
>> I cant still able to figured it out how it can be done..Can you explain
>> elaborately.
>> please..
>>
>> Regards,
>> Chandresh
>>
>> Enis Soztutar wrote:
>>
>>> cha wrote:
>>>
>>>> hi sagar,
>>>>
>>>> Thanks for the reply.
>>>>
>>>> Actually am trying to digg out the code in the same class..but not able
>>>> to
>>>> figure it out from where Urls has been read.
>>>>
>>>> When you dump the database, the file contains :
>>>>
>>>> http://blog.cha.com/ Version: 4
>>>> Status: 2 (DB_fetched)
>>>> Fetch time: Fri Apr 13 15:58:28 IST 2007
>>>> Modified time: Thu Jan 01 05:30:00 IST 1970
>>>> Retries since fetch: 0
>>>> Retry interval: 30.0 days
>>>> Score: 0.062367838
>>>> Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
>>>> Metadata: null
>>>>
>>>> I figured it out rest of the things but not sure how the Url name has
>>>> been
>>>> read..
>>>>
>>>> I just want plain urls only in the text file..It is possible that i
>>>> can
>>>> use
>>>> to write url in some xml formats..If yes then how?
>>>>
>>>> Awaiting,
>>>>
>>>> Chandresh
>>>>
>>>>
>>>>
>>> Hi, crawldb is a actually a map file, which has urls as keys(Text class)
>>> and CrawlDatum objects as values. You can write a generic map file
>>> reader and then which extracts the keys and dumps to a file.
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/extracting-urls-into-text-files-tf3409030.html#a9554858
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general