i can successfully retrieve source page from segment with this bin/nutch readseg -dump crawl_folder/segments/segment_folder_name(i dont know how to include all folders so if you tell i appreciate that)/ extract_folder_name -nofetch -nogenerate -noparse -noparsedata -noparsetex
so this brings source code but it still includes html tags like
<div id="facebox" style="display:none;"> \
<div class="popup"> \
<table> \
<tbody> \
<tr> \
but i want only text for language detection
how i can do that ty
