It's writing the segments to a new directory then I believe merging them and the index... or am i reading the script wrong?
Stefan Neufeind wrote: >Oh sorry, I didn't look up the script again from your earlier mail. Hmm, >I guess you can live fine without the invertlinks (if I'm right). Are >you sure that your indexing works fine? I think if an index exists nutch >complains. See if there is any error with indexing. Also maybe try to >delete your current index before indexing again. > >Still doesn't work? > > >Regards, > Stefan > >Matthew Holt wrote: > > >>Sorry to be asking so many questions.. Below is the current script I'm >>using. It's indexing the segments.. so do I use invertlinks directly >>after the fetch? I'm kind of confused.. thanks. >>matt >> >> > >[...] > > > >>--------------------------------------------------------------- >> >>Stefan Neufeind wrote: >> >> >> >>>You miss actually indexing the pages :-) This is done inside the >>>"crawl"-command which does everything in one. After you fetched >>>everything use: >>> >>>nutch invertlinks ... >>>nutch index ... >>> >>>Hope that helps. Otherwise let me know and I'll dig out the complete >>>commandlines for you. >>> >>> >>>Regards, >>>Stefan >>> >>>Matthew Holt wrote: >>> >>> >>> >>> >>>>Just FYI.. After I do the recrawl, I do stop and start tomcat, and still >>>>the newly created page can not be found. >>>> >>>>Matthew Holt wrote: >>>> >>>> >>>> >>>> >>>>>The recrawl worked this time, and I recrawled the entire db using the >>>>>-adddays argument (in my case ./recrawl crawl 10 31). However, it >>>>>didn't find a newly created page. >>>>> >>>>>If I delete the database and do the initial crawl over again, the new >>>>>page is found. Any idea what I'm doing wrong or why it isn't finding >>>>>it? >>>>> >>>>>Thanks! >>>>>Matt >>>>> >>>>>Matthew Holt wrote: >>>>> >>>>> >>>>> >>>>> >>>>>>Stefan, >>>>>>Thanks a bunch! I see what you mean.. >>>>>>matt >>>>>> >>>>>>Stefan Neufeind wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>Matthew Holt wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>Hi all, >>>>>>>>I have already successfuly indexed all the files on my domain only >>>>>>>>(as >>>>>>>>specified in the conf/crawl-urlfilter.txt file). >>>>>>>> >>>>>>>>Now when I use the below script (./recrawl crawl 10 31) to >>>>>>>>recrawl the >>>>>>>>domain, it begins indexing pages off of my domain (such as >>>>>>>>wikipedia, >>>>>>>>etc). How do I prevent this? Thanks! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>Hi Matt, >>>>>>> >>>>>>>have a look at regex-urlfilter. "crawl" is special in some ways. >>>>>>>Actually it's "shortcut" for several steps. And it has a special >>>>>>>urlfilter-file. But if you do it in several steps that >>>>>>>urlfilter-file is >>>>>>>no longer used. >>>>>>> >>>>>>> > > > _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
