how to retrieve only content text not html text

cefurkan0 cefurkan0 Thu, 08 Apr 2010 13:47:55 -0700

i can successfully retrieve source page from segment with this

bin/nutch readseg -dump crawl_folder/segments/segment_folder_name(i dont
know how to include all folders so if you tell i appreciate that)/
extract_folder_name -nofetch -nogenerate -noparse -noparsedata -noparsetex


so this brings source code but it still includes html tags like

  <div id="facebox" style="display:none;"> \
      <div class="popup"> \
        <table> \
          <tbody> \
            <tr> \


but i want only text for language detection

how i can do that ty

how to retrieve only content text not html text

Reply via email to