i can successfully retrieve source page from segment with this

bin/nutch readseg -dump crawl_folder/segments/segment_folder_name(i dont
know how to include all folders so if you tell i appreciate that)/
extract_folder_name -nofetch -nogenerate -noparse -noparsedata -noparsetex

so this brings source code but it still includes html tags like

  <div id="facebox" style="display:none;"> \
      <div class="popup"> \
        <table> \
          <tbody> \
            <tr> \

but i want only text for language detection

how i can do that ty

Reply via email to