[
https://issues.apache.org/jira/browse/ANY23-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661795#comment-13661795
]
Kunal P edited comment on ANY23-154 at 5/20/13 5:53 AM:
--------------------------------------------------------
Yeah, I have gone through the suggestion given in ANY23-131. But using that
method we may get redundant data (Nested specifications will also be extracted
as topLevelItemNodes).
So my suggestion is to remove following snippet of code which does the same
thing (removing nested specifications).
if (!isItemProp(itemScope)) {
topLevelItemScopes.add(itemScope);
}
AFAIK, such ItemScopes will be removed by getUnnestedNodes() method. So we can
reduce overhead.
And another important thing is that we can get all itemscopes which are
actually on top level. Am I correct?
For our use cases, few of the itemscopes are eliminated by this snippet because
of following HTML structure..
<div id="someid" itemprop="offer" itemscope
itemtype="http://schema.org/Offer">
</div>
Thanks.
was (Author: patelkunal89):
Yeah, I have gone through the suggestion given in ANY23-131. But using that
method we may get redundant data (Nested specifications will also be extracted
as topLevelItemNodes). So my suggestion is to remove following snippet of code
which does the same thing (removing nested specifications).
if (!isItemProp(itemScope)) {
topLevelItemScopes.add(itemScope);
}
AFAIK, such ItemScopes will be removed by getUnnestedNodes() method. So we can
reduce overhead.
And another important thing is that we can get all itemscopes which are
actually on top level. Am I correct?
For our use cases, few of the itemscopes are eliminated by this snippet because
of following HTML structure..
<div id="someid" itemprop="offer" itemscope
itemtype="http://schema.org/Offer">
</div>
> Not able to extract microdata in few test cases
> -----------------------------------------------
>
> Key: ANY23-154
> URL: https://issues.apache.org/jira/browse/ANY23-154
> Project: Apache Any23
> Issue Type: Bug
> Components: core
> Affects Versions: 0.7.0
> Environment: Windows 7 32bit
> JDK 1.6.0_38
> Intel Core 2 duo and 4GB RAM
> Reporter: Kunal P
> Fix For: 0.9.0
>
> Attachments: neeraj.nowfloats.com.htm, XOYRVIbK.part
>
>
> we are using ApacheAny23 API for extracting microdata from the given web-page
> as part of internal project.
> we have some test cases where api is not able to parse the microdata.
> www.neeraj.nowfloats.com (The web page is not following schema.org standards
> strictly)
> I am giving the snippit of the HTML code here.
> <div id="someid" itemprop="offer" itemscope
> itemtype="http://schema.org/Offer">
> <div ... ></div>
> </div>
> It clearly shows that given microdata is a child of some parent microdata
> specification as it contains itemscope as well as itemprop in the same tag.
> And the given <div id="someid"> tag has no parent microdata specification.
> The method used for extracting ItemScopes is as follows,
> import org.apache.any23.extractor.microdata.ItemScope;
> import org.apache.any23.extractor.microdata.MicrodataParser;
> import org.apache.any23.extractor.microdata.MicrodataParserReport;
> Document dom = getDomDocument(String html)
> MicrodataParserReport report = MicrodataParser.getMicrodata(dom);
> ItemScope[] items = report.getDetectedItemScopes();
> here, items doesnt contain any ItemScope which has above test case.
> In such scenario, how can we extract microdata from the page using any23 api.
> Is there any way to relax the criterion of itemprop and itemscope not
> appearing in the same tag so that we get the data from the webpage.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira